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Abstract 


This  thesis  develops  online  algorithms  that  can  be  used  to  solve  a  wide 
variety  of  NP-hard  problems  more  efficiently  in  practice.  The  common  ap¬ 
proach  taken  by  all  our  online  algorithms  is  to  improve  the  performance  of 
one  or  more  existing  algorithms  for  a  specific  NP-hard  problem  by  adapting 
the  algorithms  to  the  sequence  of  problem  instance(s)  they  are  run  on. 

We  begin  by  presenting  an  algorithm  for  solving  a  specific  class  of  online 
resource  allocation  problems.  Our  online  algorithm  can  be  applied  in  envi¬ 
ronments  where  abstract  jobs  arrive  one  at  a  time,  and  one  can  complete  the 
jobs  by  investing  time  in  a  number  of  abstract  activities.  Provided  the  jobs  and 
activities  satisfy  certain  technical  conditions,  our  online  algorithm  is  guaran¬ 
teed  to  perform  almost  as  well  as  any  fixed  schedule  for  investing  time  in 
the  various  activities,  according  to  two  natural  measures  of  performance:  (i) 
the  average  time  required  to  complete  each  job,  and  (ii)  the  number  of  jobs 
completed  within  time  T,  for  some  fixed  deadline  T  >  0. 

In  particular,  our  online  algorithm’s  guarantees  apply  if  the  job  can  be 
written  as  a  monotone,  submodular  function  of  a  set  of  pairs  of  the  form  (v,  r), 
where  r  is  the  time  invested  in  activity  v.  Under  the  first  objective,  the  offline 
version  of  this  problem  generalizes  Min-Sum  Set  Cover  and  the  related 
Pipelined  Set  Cover  problem.  Under  the  second  objective,  the  offline 
version  of  this  problem  generalizes  the  problem  of  maximizing  a  monotone, 
submodular  set  function  subject  to  a  knapsack  constraint.  Our  online  algo¬ 
rithm  has  potential  applications  in  a  number  of  areas,  including  the  design  of 
algorithm  portfolios,  database  query  processing,  and  sensor  placement. 

We  apply  this  online  algorithm  to  the  following  problem.  We  are  given 
k  algorithms,  and  are  fed,  one  at  a  time,  a  sequence  of  problem  instances  to 
solve.  We  may  solve  each  instance  using  any  of  the  k  algorithms,  we  may 
interleave  the  execution  of  the  algorithms,  and,  if  the  algorithms  are  random¬ 
ized,  we  may  periodically  restart  them  with  a  fresh  random  seed.  Our  goal  is 
to  minimize  the  total  CPU  time  required  to  solve  all  the  instances.  Using  data 
from  eight  recent  solver  competitions,  we  show  that  our  online  algorithm  and 
its  offline  counterpart  can  be  used  to  improve  the  performance  of  state-of-the- 
art  solvers  in  a  number  of  problem  domains,  including  Boolean  satisfiability, 
zero-one  integer  programming,  constraint  satisfaction,  and  theorem  proving. 


We  next  present  an  online  algorithm  that  can  be  used  to  improve  the  perfor¬ 
mance  of  algorithms  that  solve  an  optimization  problem  by  making  a  sequence 
of  calls  to  a  decision  procedure  that  answers  questions  of  the  form  “Is  there  a 
solution  of  cost  at  most  fc?”  We  present  an  adaptive  strategy  for  determining 
the  sequence  of  questions  to  ask,  along  with  bounds  on  the  maximum  time  to 
spend  waiting  for  an  answer  to  each  question.  Under  the  assumption  that  the 
time  required  by  the  decision  procedure  to  return  an  answer  increases  as  k  gets 
closer  to  the  optimal  solution  cost,  our  strategy’s  performance  is  near-optimal 
when  measured  in  terms  of  a  natural  competitive  ratio.  Experimentally,  we 
show  that  applying  our  strategy  to  recent  algorithms  for  A.I.  planning  and  job 
shop  scheduling  allows  the  algorithms  to  find  approximately  optimal  solutions 
more  quickly. 

Lastly,  we  develop  algorithms  for  solving  the  max  k- armed  bandit  prob¬ 
lem,  a  variant  of  the  classical  k- armed  bandit  problem  in  which  one  seeks  to 
maximize  the  highest  payoff  received  on  any  single  trial,  rather  than  the  cu¬ 
mulative  payoff.  A  strategy  for  solving  the  max  A;-armed  bandit  problem  can 
be  used  to  allocate  trials  among  multi-start  optimization  heuristics.  Motivated 
by  results  in  extreme  value  theory,  we  present  a  no-regret  strategy  for  the  spe¬ 
cial  case  in  which  each  arm  returns  payoffs  drawn  from  a  generalized  extreme 
value  distribution.  We  also  present  a  heuristic  strategy  that  solves  the  max  k- 
armed  bandit  problem  using  a  strategy  for  the  classical  A-armcd  bandit  prob¬ 
lem  as  a  subroutine.  Experimentally,  we  show  that  our  max  k- armed  bandit 
strategy  can  be  used  to  effectively  allocate  trials  among  multi-start  heuristics 
for  the  RCPSP/max,  a  difficult  real-world  scheduling  problem. 
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Chapter  1 
Introduction 


This  thesis  is  about  solving  NP-hard  computational  problems  more  efficiently  in  practice. 

Although  conjectured  to  be  worst-case  intractable,  NP-hard  problems  arise  frequently 
in  the  real  world.  Solving  them  efficiently  is  a  central  concern  in  fields  such  as  operations 
research,  computational  biology,  artificial  intelligence,  and  formal  verification. 

Looking  over  the  past  few  decades  of  computer  science  research,  we  may  distinguish 
several  high-level  approaches  to  dealing  with  NP-hard  problems: 

1 .  Problem-specific  theoretical  analysis.  Instances  of  this  approach  include  the  devel¬ 
opment  of  constant  factor  approximation  algorithms  for  a  wide  variety  of  NP-hard 
optimization  problems  [87],  improved  exponential-time  algorithms  [88],  and  analy¬ 
ses  of  algorithms  for  random  and  semi-random  problems  [25]. 

2.  Problem-specific  engineering.  Examples  of  this  approach  include  the  ongoing  quest 
for  efficient  Boolean  satisfiability  solvers  [91],  and  algorithms  for  solving  specific 
operations  research  problems  such  as  job  shop  scheduling  [39]. 

3.  Black-box  optimization.  A  number  of  algorithms  have  been  developed  that  aim  to 
solve  a  wide  variety  of  optimization  problems,  given  only  black-box  access  to  the  to- 
be-optimized  function.  Example  of  such  algorithms  include  the  simulated  annealing 
algorithm  [50],  genetic  algorithms  [32],  and  genetic  programming  [53,  54]. 

Each  of  these  approaches  represents  an  active  area  of  research  unto  itself,  with  entire 
conferences  and  hundreds  of  papers  published  every  year. 

This  thesis  advances  an  approach  that  is  different  from,  orthogonal  to,  and  comple¬ 
mentary  to  each  of  the  approaches  just  mentioned.  At  a  high  level,  the  goal  of  this  thesis 
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is  to  improve  the  performance  of  existing  heuristics  for  NP-hard  problems  by  adapting  the 
heuristics  to  the  problem  instance(s)  they  are  run  on.  In  relationship  to  the  three  techniques 
just  discussed,  the  approach  taken  in  this  thesis  lies  at  a  level  of  abstraction  somewhere  in 
between  the  problem- specific  engineering  approaches  and  the  black-box  approaches. 

A  distinguishing  feature  of  this  work  is  that  the  adaptation  can  be  performed  on-the-fly, 
while  solving  a  sequence  of  problem  instances.  Our  online  algorithms  come  with  rigorous 
performance  guarantees,  stated  either  as  regret  bounds  or  as  a  competitive  ratio. 

In  addition  to  proving  theoretical  guarantees,  we  evaluate  our  algorithms  experimen¬ 
tally  using  state-of-the-art  solvers  in  a  wide  array  of  real-world  problem  domains.  In  many 
cases,  our  algorithms  are  able  to  automatically  produce  new  solvers  that  significantly  out¬ 
perform  the  existing  ones. 


1.1  Summary 

This  thesis  is  organized  into  six  chapters. 

•  Chapter  1  is  the  introduction. 

•  Chapter  2,  “Online  Algorithms  for  Maximizing  Submodular  Functions”,  develops 
algorithms  for  solving  an  online  resource  allocation  problem  that  generalizes  several 
previously-studied  online  problems.  The  algorithms  developed  in  Chapter  2  form 
the  basis  for  many  of  the  experimental  and  theoretical  results  in  Chapter  3. 

•  Chapter  3,  “Combining  Multiple  Heuristics  Online”,  presents  techniques  for  com¬ 
bining  multiple  problem-solving  algorithms  into  an  improved  algorithm  by  inter¬ 
leaving  the  execution  of  the  algorithms  and,  if  the  algorithms  are  randomized,  pe¬ 
riodically  restarting  them  with  a  fresh  random  seed.  An  important  feature  of  the 
work  presented  in  this  chapter  is  that  a  schedule  for  interleaving  and  restarting  the 
algorithms  can  be  learned  on-the-fly  while  solving  a  sequence  of  problems. 

•  Chapter  4,  “Using  Decision  Procedures  Efficiently  for  Optimization”,  presents  tech¬ 
niques  for  improving  the  performance  of  algorithms  that  solve  an  optimization  prob¬ 
lem  by  making  a  sequence  of  calls  to  an  algorithm  for  the  corresponding  decision 
problem. 

•  Chapter  5,  “The  Max  A- Armed  Bandit  Problem”,  studies  a  variant  of  the  classical 
multi-armed  bandit  problem  in  which  the  goal  is  to  maximize  the  maximum  payoff 
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received,  rather  than  the  sum  of  the  payoffs.  Algorithms  for  solving  the  max  k- 
armed  bandit  problem  can  be  used  to  improve  the  performance  of  multi-start  heuris¬ 
tics,  which  obtain  a  solution  to  an  optimization  problem  by  performing  a  number  of 
independent  runs  of  a  randomized  heuristic  and  returning  the  best  solution  obtained. 

•  Chapter  6  is  the  conclusion. 

In  the  subsections  that  follow  we  formally  define  the  problems  considered  in  chapters 
2  through  5,  discuss  the  motivation  for  studying  each  problem,  and  summarize  the  main 
theoretical  and  experimental  results.  Some  of  the  text  in  these  subsections  is  duplicated  in 
the  introductory  sections  of  the  corresponding  chapters. 

The  results  in  this  thesis  are  based  in  part  on  five  conference  papers  [78,  79,  80,  82,  83] 
and  a  working  paper  [76]. 

1.1.1  Online  algorithms  for  maximizing  submodular  functions 

In  this  chapter  we  develop  algorithms  for  solving  a  class  of  online  resource  allocation 
problems,  which  can  be  described  formally  as  follows.  We  are  given  as  input  a  set  V  of 
activities.  A  pair  (v,  r)  G  V  x  M>0  is  called  an  action,  and  specifies  that  time  r  is  to  be 
invested  in  activity  v.  A  schedule  is  a  sequence  of  actions.  We  denote  by  S  the  set  of 
all  schedules.  A  job  is  a  function  /  :  S  — >  [0, 1],  where  for  any  S  G  S,  /(S )  equals 
the  proportion  of  some  task  that  is  accomplished  after  performing  the  sequence  of  actions 
S.  We  require  that  a  job  /  satisfy  the  following  conditions  (here  ©  is  the  concatenation 
operator): 

1.  (monotonicity)  for  any  schedules  Si,S2  G  S,  we  have  /(Si)  <  /(Si  ©  S2)  and 
/(S2)</(S  1©S2). 

2.  (submodularity)  for  any  schedules  Si,  S2  G  S  and  any  action  a  G  A, 

/(Si  ©  S2  ©  (a))  -  /(Si  ©  S2)  <  /(Si  ©  (a))  -  /(Si)  .  (1.1) 

We  will  evaluate  schedules  in  terms  of  two  objectives.  The  first  objective  is  to  minimize 

/»oo 

c  (/,  S)  —  1  -  /  (S/p)  df  (1.2) 

Jt= o 

where  Sm  is  the  schedule  that  results  from  truncating  schedule  S  at  time  t.  For  example 

if  S  =  {(hi,  3),  (h2,  3))  then  S(5)  =  {(hi,  3),  (h2,  2)).  We  refer  to  c  (/,  S)  as  the  cost  of  S. 
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The  second  objective  is  to  maximize  f  (S(t))  for  some  fixed  T  >  0.  We  refer  to  f  (S(T)) 
as  the  coverage  of  S  at  time  T. 

In  the  online  setting,  an  arbitrary  sequence  (/j,  /2, . . . ,  fn)  of  jobs  arrive  one  at  a  time, 
and  we  must  finish  each  job  (via  some  schedule)  before  moving  on  to  the  next  job.  When 
selecting  a  schedule  S)  to  use  to  finish  job  /j,  we  have  knowledge  of  the  previous  jobs 
fi,  f 2, ...  i  fi-i  but  we  have  no  knowledge  of  f%  itself  or  of  any  subsequent  jobs.  In  this 
setting  we  develop  schedule-selection  strategies  that  minimize  regret ,  which  is  a  measure 
of  the  difference  between  the  average  cost  (or  average  coverage)  of  the  schedules  produced 
by  our  online  algorithm  and  that  of  the  best  single  schedule  (in  hindsight)  for  the  given 
sequence  of  jobs. 

To  understand  the  rationale  for  studying  these  two  problems,  consider  the  following 
example.  Let  each  activity  v  represent  a  randomized  algorithm  for  solving  some  decision 
problem,  and  let  the  action  (v.  r)  represent  running  the  algorithm  (with  a  fresh  random 
seed)  for  time  r.  Fix  some  particular  instance  of  the  decision  problem,  and  for  any  sched¬ 
ule  S,  let  /(.S')  be  the  probability  that  one  (or  more)  of  the  runs  in  the  sequence  S  yields 
a  solution  to  the  instance.  We  show  in  §2.1.4  that  /  satisfies  the  conditions  required  of  a 
job.  Then  f(Sm)  is  (by  definition)  the  probability  that  performing  the  runs  in  schedule 
S  yields  a  solution  to  the  problem  instance  in  time  <  T.  For  any  non-negative  random 
variable  X,  we  have  E  [. X ]  =  J™  P  [X  >  t]  dt.  Thus  c  (/,  S)  is  the  expected  time  that 
elapses  before  a  solution  is  obtained. 

Under  each  of  the  two  objectives  just  defined,  the  problem  introduced  in  this  chapter 
generalizes  a  number  of  previously-studied  problems.  The  problem  of  minimizing  c  (/,  S) 
generalizes  Min-Sum  Set  Cover  [26],  Pipelined  Set  Cover  [44,  64],  the  problem  of 
constructing  efficient  sequences  of  trials  [22],  the  problem  of  constructing  task-switching 
schedules  [73,  78],  and  the  problem  of  constructing  restart  schedules  [35,  61,  79].  The 
problem  of  maximizing  f(S(j\  )  for  some  fixed  T  >  0  generalizes  the  problem  of  maximiz¬ 
ing  a  monotone  submodular  set  function  subject  to  a  knapsack  constraint  [56,  84],  which  in 
turns  generalizes  Budgeted  Maximum  Coverage  [49]  and  Max  k- Coverage  [65]. 
Prior  to  our  work,  many  of  these  problems  had  only  been  considered  in  an  offline  set¬ 
ting.  For  the  problems  that  had  been  considered  in  an  online  setting,  the  online  algorithms 
presented  in  this  chapter  provide  new  and  stronger  guarantees. 

We  now  summarize  the  main  technical  contributions  of  this  chapter. 

We  first  consider  the  problem  of  computing  an  optimal  schedule  in  an  offline  setting, 
given  black-box  access  to  the  job  /.  As  immediate  corollaries  of  existing  results  [24,  26], 
we  obtain  that  for  any  e  >  0,  (i)  achieving  an  approximation  ratio  of  4— e  for  the  problem  of 
minimizing  c  (/,  S)  is  NP-hard  and  (ii)  achieving  an  approximation  ratio  of  1  —  -  +  e  for  the 
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problem  of  maximizing  f(S(r))  is  NP-hard.  Building  on  and  generalizing  previous  work 
[26,  84],  we  then  present  an  offline  greedy  approximation  algorithm  that  simultaneously 
achieves  the  optimal  approximation  ratios  (of  4  and  1  —  -,  respectively)  for  each  of  these 
two  problems. 

We  then  consider  the  online  setting.  In  this  setting  we  provide  an  online  algorithm 
whose  worst-case  performance  approaches  that  of  the  offline  greedy  algorithm  asymptot¬ 
ically  (as  the  number  of  jobs  approaches  infinity).  Assuming  P  f  NP,  this  guarantee  is 
essentially  the  best  possible  among  online  algorithms  that  make  decisions  in  polynomial 
time. 

Our  online  algorithms  can  be  used  in  several  different  feedback  settings.  We  first 
consider  the  feedback  setting  in  which,  after  using  schedule  S)  to  complete  job  /,,  we 
receive  complete  access  to  /).  We  then  consider  more  limited  feedback  settings  in  which: 
(0  to  receive  access  to  f%  we  must  pay  a  price  C,  which  is  added  to  the  regret,  (ii)  we  only 
observe  f  for  each  t  >  0,  and  (iii)  we  only  observe  /,  (Si).  These  limited  feedback 

settings  arise  naturally  in  the  applications  discussed  in  the  next  chapter. 


1.1.2  Combining  multiple  heuristics  online 

Many  important  computational  problems  are  NP-hard  and  thus  seem  unlikely  to  admit 
algorithms  with  provably  good  worst-case  performance,  yet  must  be  solved  as  a  matter 
of  practical  necessity.  For  many  of  these  problems,  heuristics  have  been  developed  that 
perform  much  better  in  practice  than  a  worst-case  analysis  would  guarantee.  Nevertheless, 
the  behavior  of  a  heuristic  on  a  previously  unseen  problem  instance  can  be  difficult  to 
predict  in  advance.  The  running  time  of  a  heuristic  may  vary  by  orders  of  magnitude  across 
seemingly  similar  problem  instances  or,  if  the  heuristic  is  randomized,  across  multiple  runs 
on  a  single  instance  that  use  different  random  seeds  [33,  38].  For  this  reason,  after  running 
a  heuristic  unsuccessfully  for  some  time  one  might  decide  to  suspend  the  execution  of 
that  heuristic  and  start  running  a  different  heuristic  (or  the  same  heuristic  with  a  different 
random  seed). 

In  this  chapter  we  consider  the  problem  of  allocating  CPU  time  to  various  heuristics 
so  as  to  minimize  the  time  required  to  solve  one  or  more  instances  of  a  decision  problem. 
We  consider  the  problem  of  selecting  an  appropriate  schedule  in  three  settings:  offline, 
learning-theoretic,  and  online.  The  results  in  this  chapter  significantly  generalize  and 
extend  previous  work  on  algorithm  portfolios  [33,  38,  68,  73,  90]  and  restart  schedules 
[31,35,61], 

The  problem  considered  in  this  chapter  can  be  described  formally  as  follows.  We  are 
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given  as  input  a  set  H  of  (randomized)  algorithms  for  solving  some  decision  problem. 
Given  a  problem  instance,  each  h  E  hi  is  capable  of  returning  a  provably  correct  “yes”  or 
“no”  answer  to  the  problem,  but  the  time  required  for  a  given  h  6  hi  to  return  an  answer 
depends  both  on  the  problem  instance  and  on  the  random  seed  (and  may  be  infinite).  We 
solve  each  problem  instance  by  interleaving  the  execution  of  the  heuristics  according  to 
some  schedule.  Consistent  with  the  framework  of  Chapter  2,  we  consider  schedules  of  the 
form 

S  =  ((/li,7i),(/l2,T2),...) 

where  each  pair  (/p,  r?)  specifies  that  time  rt  is  to  be  invested  in  heuristic  ht. 

We  allow  each  heuristic  h  to  be  executed  in  one  of  two  models.  If  h  is  executed  in 
the  restart  model ,  then  each  action  (h,  r)  represents  an  independent  run  of  h  with  a  fresh 
random  seed.  If  h  is  executed  in  the  suspend-and-resume  model,  then  each  action  (//,,  r) 
represents  continuing  a  single  run  of  h  for  an  additional  r  time  units. 

This  class  of  schedules  includes  both  task-switching  schedules  [73]  and  restart  sched¬ 
ules  [61]  as  special  cases.  A  task- switching  schedule  is  a  schedule  that  executes  all  heuris¬ 
tics  in  the  suspend-and-resume  model.  A  restart  schedule  is  a  schedule  for  a  single  ran¬ 
domized  heuristic  (i.e.,  \'H\  =  1),  executed  in  the  restart  model. 


Motivations 

To  appreciate  the  power  of  task- switching  schedules,  consider  Table  1.1,  which  shows  the 
behavior  of  the  top  two  solvers  from  the  industrial  track  of  the  2007  SAT  competition  on 
three  of  the  competition  benchmarks. 


Table  1.1:  Behavior  of  two  solvers  on  instances  from  the  2007  SAT  competition. 


Instance 

Rsat 

picosat 

CPU  (s) 

CPU  (s) 

industrial/anbulagan/medium-sat/dated-10-13-s.cnf 

45 

28 

industrial/babic/dspam/dspam_dump  jvc  108 1  .cnf 

3 

>  10000 

industrial/gricu/vmpc  3 1  .cnf 

>  10000 

238 

On  these  benchmarks,  interleaving  the  execution  of  the  solvers  according  to  an  appro¬ 
priate  schedule  can  dramatically  improve  average-case  running  time.  Indeed,  in  this  case 
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simply  running  the  two  solvers  in  parallel  (e.g.,  at  equal  strength  on  a  single  processor) 
would  reduce  the  average-case  running  time  by  orders  of  magnitude. 

To  appreciate  the  power  of  restart  schedules,  consider  Figure  1,  which  depicts  the  run 
length  distribution  of  the  SAT  solver  satz-rand  on  a  Boolean  formula  derived  from  a 
logistics  planning  benchmark.  When  run  on  this  formula,  satz-rand  exhibits  a  heavy¬ 
tailed  run  length  distribution.  There  is  about  a  20%  chance  of  solving  the  problem  after 
running  for  2  seconds,  but  also  a  20%  chance  that  a  run  will  not  terminate  after  having  run 
for  1000  seconds.  Restarting  the  solver  every  2  seconds  reduces  the  mean  running  time  by 
more  than  an  order  of  magnitude. 


satz-rand  running  on  logistics.d  (length  14) 


time  (s) 


Figure  1.1:  Run  length  distribution  of  satz-rand  on  a  formula  derived  from  a  logistics 
planning  benchmark. 


Results 

We  now  summarize  the  main  technical  results  of  this  chapter.  As  already  mentioned, 
this  chapter  considers  the  schedule-selection  problem  in  three  settings:  offline,  learning- 
theoretic,  and  online. 

In  the  offline  setting  we  are  given  as  input  the  run  length  distribution  of  each  h  €  H  for 
each  problem  instance  in  a  set  of  instances,  and  wish  to  compute  a  schedule  with  minimum 
average  (expected)  running  time  over  the  instances  in  the  set.  In  this  setting,  the  greedy 
algorithm  from  Chapter  2  gives  a  4  approximation  to  the  optimal  schedule  and,  for  any  e  > 
0,  computing  an  4  —  e  approximation  is  NP-hard.  We  also  give  exact  and  approximation 
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algorithms  based  on  shortest  paths  that  are  able  to  compute  an  a -approximation  to  the 
optimal  schedule  for  any  a  >  1,  but  whose  running  time  is  exponential  as  a  function  of 

\n\. 

In  the  learning-theoretic  setting,  we  draw  training  instances  from  a  fixed  distribution, 
compute  an  (approximately)  optimal  schedule  for  the  training  instances,  and  then  use  that 
schedule  to  solve  additional  test  instances  drawn  from  the  same  distribution.  In  this  setting, 
we  give  bounds  on  the  number  of  training  instances  required  to  learn  a  schedule  that  is 
probably  approximately  correct. 

In  the  online  setting  we  are  fed  a  sequence  of  problem  instances  one  at  a  time  and 
must  obtain  a  solution  to  each  instance  before  moving  on  to  the  next.  In  this  setting  we 
show  that  the  online  greedy  algorithm  from  Chapter  2  converges  to  a  4  approximation 
to  the  best  fixed  schedule  for  the  instance  sequence,  and  requires  decision-making  time 
polynomial  in  \Tt\.  We  also  present  online  shortest  paths  algorithms  that,  for  any  a  >  1, 
can  be  guaranteed  to  converge  to  an  o-approximation  to  the  best  fixed  schedule,  but  these 
online  algorithms  require  decision-making  time  exponential  in  \H\. 

Our  results  in  each  of  these  three  settings  can  be  extended  in  two  ways.  First,  our 
algorithms  can  be  applied  in  an  interesting  way  to  heuristics  for  optimization  rather  than 
decision  problems.  Second,  quickly-computable  features  of  problem  instances  can  be  ex¬ 
ploited  in  a  principled  way  to  improve  the  schedule  selection  process. 

This  chapter  concludes  with  an  experimental  evaluation  of  the  techniques  developed 
for  both  the  offline  and  online  settings.  The  main  results  of  our  experimental  evaluation 
can  be  summarized  as  follows. 

1.  Using  data  from  recent  solver  competitions,  we  show  that  schedules  computed  by 
our  algorithms  can  be  used  improve  the  performance  of  state-of-the-art  solvers  in 
several  problem  domains,  including  Boolean  satisfiability,  A. I.  planning,  constraint 
satisfaction,  and  theorem  proving. 

2.  We  apply  our  algorithms  to  optimization  problems  (as  opposed  to  decision  prob¬ 
lems),  and  demonstrate  that  they  can  be  used  to  improve  the  performance  of  state-of- 
the-art  algorithms  for  pseudo-Boolean  optimization  (also  known  as  zero-one  integer 
programming). 

3.  We  show  that  additional  performance  improvements  can  be  obtained  by  using  instance- 
specific  features  to  tailor  the  choice  of  schedule  to  a  particular  problem  instance. 

4.  We  use  our  offline  algorithms  to  construct  a  restart  schedule  for  the  SAT  solver 
s at  z -rand  that  improves  its  performance  on  an  ensemble  of  problem  instances 
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derived  from  logistics  planning  benchmarks. 


1.1.3  Using  decision  procedures  efficiently  for  optimization 

Optimization  problems  are  often  solved  by  making  repeated  calls  to  a  decision  procedure 
that  answers  questions  of  the  form  “Does  there  exist  a  solution  with  cost  at  most  kT\  Each 
query  to  the  decision  procedure  can  be  represented  as  a  pair  (k,  t ),  where  t  is  a  bound  on 
the  CPU  time  the  decision  procedure  may  consume  in  answering  the  question.  The  result 
of  a  query  is  either  a  (provably  correct)  “yes”  or  “no”  answer  or  a  timeout.  A  query  strategy 
is  a  rule  for  determining  the  next  query  (k,  t )  as  a  function  of  the  responses  to  previous 
queries. 


>100  hours 


1200 


IZ> 

T3 

1=1 

o 

u 

<u 

1/3 


Ph 

u 


■  no 
yes 

□  timeout 


600 


6  11  16  21  26 

Makespan  bound  (k) 


Figure  1.2:  Behavior  of  the  SAT  solver  siege  running  on  formulae  generated  by 
SATPLAN  to  solve  instance  pi  7  from  the  pathways  domain  of  the  2006  International 
Planning  Competition. 

One  optimization  algorithm  of  this  form  is  SATPLAN,  a  state-of-the-art  algorithm  for 
classical  planning.  SATPLAN  finds  a  minimum-length  plan  by  making  a  series  of  calls  to 
a  SAT  solver,  where  each  call  determines  whether  there  exists  a  feasible  plan  of  makespan 
<  k  (where  the  value  of  k  varies  across  calls).  The  original  version  of  SATPLAN  uses  the 
ramp-up  query  strategy,  which  simply  executes  the  queries  (1,  oo),  (2,  oo),  (3,  oo), ...  in 
sequence  (stopping  as  soon  as  a  “yes”  answer  is  obtained). 
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The  motivation  for  the  work  in  this  chapter  is  that  the  choice  of  query  strategy  often 
has  a  dramatic  effect  on  the  time  required  to  obtain  a  (provably)  approximately  optimal 
solution.  As  an  example,  consider  Figure  1.2,  which  shows  the  CPU  time  required  by  the 
query  (k,  oo)  as  a  function  of  k,  on  a  particular  planning  benchmark  instance.  On  this 
instance,  using  the  ramp-up  query  strategy  requires  one  to  invest  over  100  hours  of  CPU 
time  before  obtaining  a  feasible  plan.  On  the  other  hand,  executing  the  queries  (18,  oo) 
and  (23,  oo)  takes  less  than  two  minutes  and  yields  a  plan  whose  makespan  is  provably  at 
most  ps  1.21  times  optimal. 

This  chapter  presents  both  a  theoretical  and  an  experimental  study  of  query  strategies. 
We  consider  the  problem  of  devising  query  strategies  in  two  settings.  In  the  single-instance 
setting,  we  are  confronted  with  a  single  optimization  problem,  and  wish  to  obtain  an  (ap¬ 
proximately)  optimal  solution  as  quickly  as  possible.  In  the  multiple -instance  setting,  we 
use  the  same  decision  procedure  to  solve  a  number  of  optimization  problems,  and  our  goal 
is  to  learn  from  experience  in  order  to  improve  performance. 

In  the  single-instance  setting,  we  are  interested  in  minimizing  the  CPU  time  required 
to  obtain  a  given  upper  or  lower  bound  on  OPT,  where  OPT  is  the  minimum  cost  of  any 
solution.  Fix  a  problem  instance,  and  let  r(k)  denote  the  CPU  time  required  by  the  deci¬ 
sion  procedure  when  run  on  input  k.  We  define  the  competitive  ratio  of  a  query  strategy 
(on  that  instance)  as  the  maximum,  over  all  k,  of  the  time  required  by  the  query  strategy 
to  determine  what  side  of  OPT  that  k  is  on  (either  by  obtaining  a  “yes”  answer  for  some 
k'  <  k,  or  by  obtaining  a  “no”  answer  for  some  k'  >  k ),  divided  by  r(k).  We  analyze 
query  strategies  in  terms  of  their  competitive  ratio  on  the  worst-case  instance  within  some 
well-defined  class  of  instances. 

The  competitive  ratio  of  our  query  strategies  will  depend  on  the  behavior  of  the  func¬ 
tion  r  (as  just  mentioned,  r{k)  is  the  CPU  time  required  by  the  decision  procedure  when 
run  on  input  k).  For  most  decision  procedures  used  in  practice,  we  expect  r(k)  to  be  an 
increasing  function  for  k  <  OPT  and  a  decreasing  function  for  k  >  OPT  (e.g.,  see  Figure 
1.2),  and  our  query  strategies  are  designed  to  take  advantage  of  this  behavior.  More  specif¬ 
ically,  our  query  strategies  are  designed  to  work  well  when  r  is  close  to  its  hull ,  which  is 
the  function 

hull(fc)  =  min  <  maxr(/co).  maxr(fci) 

(fc0<fe  kx>k 


Figure  4.2  gives  an  example  of  a  function  r  (gray  bars)  and  its  hull  (dots).  Note  that 
r  and  hull  are  identical  if  r  is  monotonically  increasing  (or  monotonically  decreasing),  or 
if  there  exists  a  K  such  that  r  is  monotonically  increasing  for  k  <  K  and  monotonically 
decreasing  for  k  >  K. 
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Figure  1.3:  A  function  r  (gray  bars)  and  its  hull  (dots). 


We  measure  the  discrepancy  between  r  and  its  hull  in  terms  of  the  quantity 


A 


max 

fc 


hull(ife) 

r(k) 


which  we  refer  to  as  the  stretch  of  r.  The  instance  depicted  in  Figure  4.2  has  a  stretch  of  2 
because  r( 2)  =  1  while  hull(2)  =  2. 

In  the  single-instance  setting,  our  main  result  is  a  query  strategy  S2  whose  worst-case 
competitive  ratio  is  O  (A  log  U ),  where  U  is  the  difference  between  the  initial  upper  and 
lower  bounds  on  OPT.  S2  makes  use  of  a  form  of  guessing-and-doubling  in  combination 
with  a  two-sided  binary  search.  We  prove  a  matching  lower  bound,  showing  that  any  query 
strategy  has  a  competitive  ratio  Q  (A  log  U )  on  some  instance.  We  also  show  that,  in  the 
absence  of  any  assumptions  about  A,  a  trivial  query  strategy  S\  based  on  guessing-and- 
doubling  obtains  a  worst-case  competitive  ratio  that  is  O  (U),  and  we  prove  a  matching 
Q  (U)  lower  bound. 

In  the  multiple-instance  setting,  we  prove  that  computing  an  optimal  query  strategy  is 
NP-hard,  and  discuss  how  algorithms  from  machine  learning  theory  can  be  used  to  learn 
an  appropriate  query  on-the-fly  while  solving  a  sequence  of  problems. 

In  the  experimental  section  of  this  chapter,  we  use  the  query  strategy  S2  to  create  a 
modified  version  of  SAT  PLAN  that  finds  (provably)  approximately  optimal  plans  more 
quickly  than  the  original  version  of  SATPLAN  (which  uses  the  ramp-up  query  strategy). 
We  also  create  a  modified  version  of  a  branch  and  bound  algorithm  for  job  shop  scheduling 
that  yields  improved  upper  and  lower  bounds  relative  to  the  original  algorithm.  In  the 
course  of  the  latter  experiments  we  develop  a  simple  method  for  applying  query  strategies 
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to  branch  and  bound  algorithms,  which  seems  likely  to  be  useful  in  other  domains  besides 
job  shop  scheduling. 


1.1.4  The  max  /c-armed  bandit  problem 

The  max  k- armed  bandit  problem  [19,  21]  can  be  described  as  follows.  Imagine  that  you 
find  yourself  in  the  following  unusual  casino.  The  casino  contains  k  slot  machines.  Each 
machine  has  an  arm  that,  when  pulled,  yields  a  payoff  drawn  from  a  fixed  (but  unknown) 
distribution.  You  are  given  n  tokens  to  use  in  playing  the  machines,  and  you  may  decide 
how  to  spend  these  tokens  adaptively  based  on  the  payoffs  you  receive  from  playing  the 
various  machines.  The  catch  is  that,  when  you  leave  the  casino,  you  only  get  to  keep  the 
maximum  of  the  payoffs  you  received  on  any  individual  pull.  The  max  k- armed  bandit 
problem  differs  from  the  well-studied  classical  A-armcd  bandit  problem  in  that  one  seeks 
to  optimize  the  maximum  payoff  received,  rather  than  the  sum  of  the  payoffs. 

Our  motivation  for  studying  this  problem  is  to  boost  the  performance  of  multi-start 
heuristics,  which  obtain  a  solution  to  an  optimization  problem  by  performing  a  number 
of  independent  runs  of  a  randomized  heuristic  and  returning  the  best  solution  obtained. 
Despite  their  simplicity,  multi-start  heuristics  are  used  widely  in  practice,  and  represent 
the  state  of  the  art  in  a  number  of  domains  [14,  20,  27].  A  max  k- armed  bandit  strategy 
can  be  used  to  distribute  trials  among  different  multi-start  heuristics  or  among  different 
parameter  settings  for  the  same  multi-start  heuristic.  Previous  work  has  demonstrated  the 
effectiveness  of  such  an  approach  on  the  RCPSP/max,  a  difficult  real-world  scheduling 
problem  [19,  21,  82]. 

In  this  chapter  our  goal  is  to  develop  strategies  for  the  max  k- armed  bandit  problem 
that  minimize  regret,  which  we  define  to  be  the  difference  between  the  (expected)  maxi¬ 
mum  payoff  our  strategy  receives  and  that  of  the  best  pure  strategy,  where  a  pure  strategy  is 
one  that  plays  the  same  arm  every  time.  It  is  not  difficult  to  show  that  regret-minimization 
is  hopeless  in  the  absence  of  any  assumptions  about  the  payoff  distributions.  As  a  simple 
example,  imagine  that  all  payoffs  are  either  0  or  1,  that  k  —  1  of  the  arms  always  return  a 
payoff  of  0,  and  that  one  randomly-selected  “good”  arm  returns  a  payoff  of  1  with  proba¬ 
bility  4.  In  this  case,  we  show  that  one  cannot  obtain  an  expected  maximum  payoff  larger 
than  4  after  n  pulls,  whereas  the  pure  strategy  that  invests  all  n  pulls  on  the  “good”  arm 
obtains  expected  maximum  payoff  1  —  (1  -*  -4)n  «  1  —  4. 

We  present  two  strategies  for  solving  the  max  k- armed  bandit  problem.  The  first  strat¬ 
egy,  Threshold  Ascent,  is  designed  to  work  well  when  the  payoff  distributions  have  certain 
characteristics  which  we  expect  to  be  present  in  cases  of  practical  interest.  Roughly  speak- 
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ing,  Threshold  Ascent  will  work  best  when  the  following  two  criteria  are  satisfied. 


1.  There  is  a  (relatively  low)  threshold  tcritiCai  such  that,  for  all  t  >  tcriucai,  the  arm 
that  is  most  likely  to  yield  a  payoff  >  £  is  the  same  as  the  arm  most  likely  to  yield  a 
payoff  >  t critical-  Call  this  arm  i*. 

2.  As  t  increases  beyond  tcriticai,  there  is  a  growing  gap  between  the  probability  that 
arm  i*  yields  a  payoff  >  t  and  the  corresponding  probability  for  other  arms.  Specif¬ 
ically,  if  we  let  pi(t)  denote  the  probability  that  the  ilh  arm  returns  a  payoff  >  t,  the 
ratio  should  increase  as  a  function  of  t  for  t  >  tcriticai,  for  any  i  ^  i*. 

Figure  1.4  illustrates  a  set  of  two  payoff  distributions  that  satisfy  these  assumptions. 


Figure  1 .4:  A  max  k- armed  bandit  instance  on  which  Threshold  Ascent  should  perform 
well. 

The  idea  of  Threshold  Ascent  is  very  simple.  Threshold  Ascent  attempts  to  maximize 
the  number  of  payoffs  >  T  that  it  receives,  where  T  is  a  threshold  that  is  gradually  in¬ 
creased  over  time.  For  any  fixed  T,  this  goal  is  accomplished  by  mapping  payoffs  >  T 
to  1  and  mapping  payoffs  <  T  to  zero,  then  treating  the  problem  as  an  instance  of  the 
classical  k- armed  bandit  problem  (where  the  goal  is  to  maximize  the  sum  of  the  payoffs 
received). 

As  T  increases,  non-zero  payoffs  become  increasingly  rare,  and  thus  we  would  like  to 
have  an  algorithm  for  solving  the  classical  k- armed  bandit  problem  that  works  well  when 
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the  mean  payoff  of  each  arm  is  very  small.  Toward  this  end,  we  design  and  analyze  a  new 
algorithm  for  the  classical  A-armcd  bandit  problem  called  Chernoff  Interval  Estimation, 
which  yields  improved  regret  bounds  when  each  arm  has  a  small  mean  payoff. 

In  the  experimental  section  of  this  chapter,  we  demonstrate  the  effectiveness  of  Thresh¬ 
old  Ascent  by  using  it  to  select  among  multi-start  heuristics  for  the  RCPSP/max,  a  diffi¬ 
cult  real-world  scheduling  problem.  We  find  that  Threshold  Ascent  (i)  performs  better 
than  any  of  the  multi- start  heuristics  performs  in  isolation,  and  (ii)  outperforms  the  recent 
QD-BEACON  max  A  -armcd  bandit  algorithm  of  Cicirello  and  Smith  [19,  21]. 

Following  the  lead  of  Cicirello  and  Smith  [19,  21],  we  also  consider  the  special  case 
where  each  payoff  distribution  is  a  generalized  extreme  value  (GEV)  distribution.  The  mo¬ 
tivation  for  studying  this  special  case  is  the  Extremal  Types  Theorem  [23],  which  singles 
out  the  GEV  as  the  limiting  distribution  of  the  maximum  of  a  large  number  of  indepen¬ 
dent  identically  distributed  (i.i.d.)  random  variables.  Roughly  speaking,  one  can  think  of 
the  Extremal  Types  Theorem  as  an  analogue  of  the  Central  Limit  Theorem.  Just  as  the 
Central  Limit  Theorem  states  that  the  average  of  a  large  number  of  i.i.d.  random  variables 
converges  in  distribution  to  a  Gaussian,  the  Extremal  Types  Theorem  states  that  the  maxi¬ 
mum  of  a  large  number  of  i.i.d.  random  variables  converges  in  distribution  to  a  GEV.  We 
provide  a  no-regret  strategy  for  this  special  case,  generalizing  and  improving  upon  earlier 
theoretical  work  by  Cicirello  &  Smith  [19,  21]. 
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Chapter  2 

Online  Algorithms  for  Maximizing 
Submodular  Functions 

2.1  Introduction 

In  this  chapter  we  present  algorithms  for  solving  a  specific  class  of  online  resource  alloca¬ 
tion  problems.  Our  online  algorithms  can  be  applied  in  environments  where  abstract  jobs 
arrive  one  at  a  time,  and  one  can  complete  the  jobs  by  investing  time  in  a  number  of  ab¬ 
stract  activities.  Provided  that  the  jobs  and  activities  satisfy  certain  technical  conditions, 
our  online  algorithm  is  guaranteed  to  perform  almost  as  well  as  any  fixed  schedule  for 
investing  time  in  the  various  activities,  according  to  two  natural  measures  of  performance. 
As  we  discuss  further  in  §2.1 .5,  our  problem  formulation  captures  a  number  of  previously- 
studied  problems,  including  selection  of  algorithm  portfolios  [33,  38],  selection  of  restart 
schedules  [35,  61],  and  database  query  optimization  [9,  64].  Additionally,  this  online  al¬ 
gorithm  forms  the  basis  for  many  of  the  theoretical  and  experimental  results  in  Chapter  3, 
“Combining  Multiple  Heuristics  Online”. 


2.1.1  Formal  setup 

The  problem  considered  in  this  chapter  can  be  defined  as  follows.  We  are  given  as  input 
a  finite  set  V  of  activities.  A  pair  (v,  r)  G  V  x  R>0  is  called  an  action ,  and  represents 
spending  time  r  performing  activity  v.  A  schedule  is  a  sequence  of  actions.  We  use  S  to 
denote  the  set  of  all  schedules.  A  job  is  a  function  /  :  S  — >  [0, 1],  where  for  any  schedule 
S  G  S.  f(S )  represents  the  proportion  of  some  task  that  is  accomplished  by  performing 
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the  sequence  of  actions  S.  We  require  that  a  job  /  satisfy  the  following  conditions  (here 
©  is  the  concatenation  operator): 

1.  (monotonicity)  for  any  schedules  5) ,  S2  £  S,  we  have  f{S\)  <  f(S i  ©  S2)  and 
f(S2)<f(S  i©52). 

2.  (submodularity)  for  any  schedules  Si,S2  £  S  and  any  action  a  £  V  x  M>0, 

f{Si  ®S2(B  (a))  -  f(S !  ©  S2)  <  f(S i  ©  (a))  -  .  (2.1) 

We  will  evaluate  schedules  in  terms  of  two  objectives.  The  first  objective  is  to  maxi¬ 
mize  /  (S)  subject  to  the  constraint  t  (S)  <  T,  for  some  fixed  T  >  0,  where  £  (S)  equals 
the  sum  of  the  durations  of  the  actions  in  S.  For  example  if  S  =  ((iq,  3),  (v2,  3)),  then 
£(S)  =  6.  We  refer  to  this  problem  as  Budgeted  Maximum  Submodular  Cover¬ 
age  (the  origin  of  this  terminology  is  explained  in  §2.2). 

The  second  objective  is  to  minimize  the  cost  of  a  schedule,  which  we  define  as 

/»oo 

c  (/,  S)  —  l  -f(S{t))dt  (2.2) 

Jt=  o 

where  S/a  is  the  schedule  that  results  from  truncating  schedule  S  at  time  t.  For  example 
if  S  —  ((vi,  3),  (v2,  3))  then  S( 5)  =  ((rq,  3),  (v2,  2)).1  One  way  to  interpret  this  objective 
is  to  imagine  that  f(S)  is  the  probability  that  some  desired  event  occurs  as  a  result  of 
performing  the  actions  in  S.  For  any  non-negative  random  variable  X,  we  have  E  [A'"]  = 
f™Q  P  [A"  >  t]  dtt.  Thus  c  (/,  S)  is  the  expected  time  we  must  wait  before  the  event  occurs 
if  we  execute  actions  according  to  the  schedule  S.  We  refer  to  the  problem  of  computing 
a  schedule  that  minimizes  c  (/,  S')  as  Min-Sum  Submodular  Cover. 

In  the  online  setting,  an  arbitrary  sequence  (/i,  f2,  ■ . . ,  fn)  of  jobs  arrive  one  at  a  time, 
and  we  must  finish  each  job  (via  some  schedule)  before  moving  on  to  the  next  job.  When 
selecting  a  schedule  Si  to  use  to  finish  job  /,,  we  have  knowledge  of  the  previous  jobs 
/i,  f2, . . . ,  ft- 1  but  we  have  no  knowledge  of  /*  itself  or  of  any  subsequent  jobs.  In  this 
setting  our  goal  is  to  develop  schedule-selection  strategies  that  minimize  regret ,  which  is  a 
measure  of  the  difference  between  the  average  cost  (or  average  coverage)  of  the  schedules 
produced  by  our  online  algorithm  and  that  of  the  best  single  schedule  (in  hindsight)  for  the 
given  sequence  of  jobs. 

The  following  example  illustrates  these  definitions. 

'More  formally,  if  S  =  (on,  a2, . . .),  where  a*  =  (vit  n),  then  =  {a1,a2,,...,  ak-i,ak,  (vk+i ,  r')), 
where  k  is  the  largest  integer  such  that  i  Ti  <  t  and  T’  =  t  ~  5Zi=i  A- 
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Example  1.  Let  each  activity  v  represent  a  randomized  algorithm  for  solving  some  deci¬ 
sion  problem,  and  let  the  action  (v,  r)  represent  running  the  algorithm  (with  a  fresh  random 
seed)  for  time  r.  Fix  some  particular  instance  of  the  decision  problem,  and  for  any  sched¬ 
ule  S,  let  f(S)  be  the  probability  that  one  (or  more)  of  the  runs  in  the  sequence  S  yields  a 
solution  to  that  instance.  So  f(S^T))  is  (by  definition)  the  probability  that  performing  the 
runs  in  schedule  S  yields  a  solution  to  the  problem  instance  in  time  <  T,  while  c  (/,  S )  is 
the  expected  time  that  elapses  before  a  solution  is  obtained.  It  is  clear  that  /(.S')  satisfies 
the  monotonicity  condition  required  of  a  job,  because  adding  runs  to  the  sequence  S  can 
only  increase  the  probability  that  one  of  the  runs  is  successful.  The  fact  that  /  is  submod- 
ular  can  be  seen  as  follows.  For  any  schedule  S  and  action  a,  f(S  ©  (a))  —  f(S )  equals 
the  probability  that  action  a  succeeds  after  every  action  in  S  has  failed,  which  can  also  be 
written  as  (1  —  f(S ))  ■  /((a)).  This,  together  with  the  monotonicity  of  /,  implies  that  for 
any  schedules  Si,  S2  and  any  action  a,  we  have 

f(Si  ©  S2  ©  (a»  -  f(Si  ©  S2)  =  (1  -  f(Si  ©  S2))  •  /((a)) 

<  (l-/(Si))-/«a» 

=  f(Si  ©  (a))  -  f(Si) 


so  /  is  submodular. 


2.1.2  Sufficient  conditions 

In  some  cases  of  practical  interest,  /  will  not  satisfy  the  submodularity  condition  but  will 
still  satisfy  weaker  conditions  that  are  sufficient  for  our  results  to  carry  through. 

In  the  offline  setting,  our  results  will  hold  for  any  function  /  that  satisfies  the  mono¬ 
tonicity  condition  and,  additionally,  satisfies  the  following  condition  (we  prove  in  §2.3  that 
any  submodular  function  satisfies  this  weaker  condition). 

Condition  1.  For  any  Si,  S  G  S, 

f(Si  ®S)~  f(Si)  ^  .  f  f(Si  ©  ((v,  r)»  -  f(Si) 

p  /  r^\  _  IXlclX  \ 

£{S)  ('U,t)eVxR>0  [  T 

Recall  that  l  (S)  equals  the  sum  of  the  durations  of  the  actions  in  S.  Informally,  Con¬ 
dition  1  says  that  the  increase  in  /  per  unit  time  that  results  from  performing  a  sequence 
of  actions  S  is  always  bounded  by  the  maximum,  over  all  actions  (v,t),  of  the  increase  in 
/  per  unit  time  that  results  from  performing  that  action. 

In  the  online  setting,  our  results  will  apply  if  each  function  ft  in  the  sequence  (/i, 
f2,  . . . ,  fn)  satisfies  the  monotonicity  condition  and,  additionally,  the  sequence  as  a  whole 
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satisfies  the  following  condition  (we  prove  in  §2.4  that  if  each  is  a  job,  then  this  condition 
is  satisfied). 

Condition  2.  For  any  sequence  Si,  S2,  ■  ■  ■  ,  Sn  of  schedules  and  any  schedule  S, 

E”,i  MS,  e  s)  -  ms,)  „  f  E”=i  Ms,  e  <(«,  r)»  -  ms,) 

- — — — -  \  max  <.  - 

r(S)  (d,t)6Vxi>0  (  r 

As  we  discuss  further  in  Chapter  3,  this  generality  allows  us  to  handle  jobs  similar  to 
the  job  defined  in  Example  1,  but  where  an  action  ( v ,  r)  may  represent  continuing  a  run  of 
algorithm  v  for  an  additional  r  time  units  (rather  than  running  v  with  a  fresh  random  seed). 
Note  that  the  function  /  defined  in  Example  1  is  no  longer  submodular  when  actions  of 
this  form  are  allowed. 


2.1.3  Summary  of  results 

We  first  consider  the  offline  problems  Budgeted  Maximum  Submodular  Cover¬ 
age  and  Min-Sum  Submodular  Cover.  As  immediate  consequences  of  existing  re¬ 
sults  [24,  26],  we  find  that,  for  any  e  >  0,  (z)  achieving  an  approximation  ratio  of  4  —  e  for 
Min-Sum  Submodular  Cover  is  NP-hard  and  (z'z)  achieving  an  approximation  ratio 
of  1  -  -  +  e  for  Budgeted  Maximum  Submodular  Coverage  is  NP-hard.  We 

e 

then  present  a  greedy  approximation  algorithm  that  simultaneously  achieves  the  optimal 
approximation  ratio  of  4  for  Min-Sum  Submodular  Cover  and  the  optimal  approx¬ 
imation  ratio  of  1  —  ~  for  Budgeted  Maximum  Submodular  Coverage,  building 
on  and  generalizing  previous  work  on  special  cases  of  these  two  problems  [26,  84]. 

The  main  contribution  of  this  chapter,  however,  is  to  address  the  online  setting.  In 
this  setting  we  provide  an  online  algorithm  whose  worst-case  performance  approaches 
that  of  the  offline  greedy  approximation  algorithm  asymptotically  (as  the  number  of  jobs 
approaches  infinity).  More  specifically,  we  analyze  the  online  algorithm’s  performance 
in  terms  of  “a-regret”.  For  the  cost-minimization  objective,  o-rcgrct  is  defined  as  the 
difference  between  the  average  cost  of  the  schedules  selected  by  the  online  algorithm  and 
a  times  the  average  cost  of  the  optimal  schedule  for  the  given  sequence  of  jobs.  For  the 
coverage-maximization  objective,  a-regret  is  the  difference  between  a  times  the  average 
coverage  of  the  optimal  fixed  schedule  and  the  average  coverage  of  the  schedules  selected 
by  the  online  algorithm.  For  the  objective  of  minimizing  cost,  the  online  algorithm’s  4- 
regret  approaches  zero  as  n  — >  oo,  while  for  the  objective  of  maximizing  coverage,  its  1  —  ^ 
regret  approaches  zero  as  n  — >  oo.  Assuming  P  ^  NP,  these  guarantees  are  essentially  the 
best  possible  among  online  algorithms  that  make  decisions  in  polynomial  time. 
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Our  online  algorithms  can  be  used  in  several  different  feedback  settings.  We  first 
consider  the  feedback  setting  in  which,  after  using  schedule  Si  to  complete  job  /,,  we 
receive  complete  access  to  /,;.  We  then  consider  more  limited  feedback  settings  in  which: 
(i)  to  receive  access  to  f)  we  must  pay  a  price  C,  which  is  added  to  the  regret,  Hi)  we  only 
observe  /*  )  for  each  t>  0,  and  (Hi)  we  only  observe  f,  (Si). 

We  also  prove  tight  information-theoretic  lower  bounds  on  1 -regret,  and  discuss  ex¬ 
ponential  time  online  algorithms  whose  regret  matches  the  lower  bounds  to  within  loga¬ 
rithmic  factors.  Interestingly,  these  lower  bounds  also  match  the  upper  bounds  from  our 
online  greedy  approximation  algorithm  up  to  logarithmic  factors,  although  the  latter  apply 
to  Q-rcgrct  (for  a  =  4ora  =  l  —  -)  rather  than  1-regret. 

The  results  in  this  chapter  are  based  on  a  working  paper  [76]. 


2.1.4  Problems  that  fit  into  this  framework 

We  now  discuss  how  a  number  of  previously-studied  problems  fit  into  the  framework  of 
this  chapter. 


Special  cases  of  Budgeted  Maximum  Submodular  Coverage 

The  Budgeted  Maximum  Submodular  Coverage  problem  introduced  in  this  chap¬ 
ter  is  a  slight  generalization  of  the  problem  of  maximizing  a  monotone  submodular  set 
function  subject  to  a  knapsack  constraint  [56,  84].  The  only  difference  between  the  two 
problems  is  that,  in  the  latter  problem,  f(S)  may  only  depend  on  the  set  of  actions  in  the 
sequence  S,  and  not  on  the  order  in  which  the  actions  appear.  The  problem  of  maximizing 
a  monotone  submodular  set  function  subject  to  a  knapsack  constraint  in  turn  generalizes 
Budgeted  Maximum  Coverage  [49],  which  generalizes  Max  k- Coverage  [65]. 


Special  cases  of  Min-Sum  Submodular  Cover 

The  Min-Sum  Submodular  Cover  problem  introduced  in  this  chapter  generalizes 
several  previously-studied  problems,  including  Min-Sum  Set  Cover  [26],  Pipelined 
Set  Cover  [44,  64],  the  problem  of  constructing  efficient  sequences  of  trials  [22],  and 
the  problem  of  constructing  restart  schedules  [35,  61,  79].  Specifically,  these  problems 
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can  be  represented  in  our  framework  by  jobs  of  the  form 


f({(v i,Ti),  (v2,r2),. . . ,  (vl,tl))) 


n 


yi  ( 1  -  n^1  -Pi(vi>Ti)) 


i=  1 


1=1 


(2.3) 


This  expression  can  be  interpreted  as  follows:  the  job  /  consists  of  n  subtasks,  and Pi(v,  r) 
is  the  probability  that  investing  time  r  in  activity  v  completes  the  ith  subtask.  Thus,  f(S')  is 
the  expected  fraction  of  subtasks  that  are  finished  after  performing  the  sequence  of  actions 
in  S.  Assuming  pfv,  r)  is  a  non-decreasing  function  of  r  for  all  i  and  v,  it  can  be  shown 
that  any  function  /  of  this  form  satisfies  the  monotonicity  and  submodularity  properties 
required  of  a  job.  In  the  special  case  n  =  1,  this  follows  from  Example  1.  In  the  general 
case  n  >  1,  this  follows  from  the  fact  (which  follows  immediately  from  the  definitions) 
that  any  convex  combination  of  jobs  is  a  job. 

The  problem  of  computing  restart  schedules  places  no  further  restrictions  on  pfv,  r). 
Pipelined  Set  Cover  is  the  special  case  in  which  for  each  activity  v  there  is  an  asso¬ 
ciated  time  tv,  and  Pi(v,r )  =  1  if  t  >  tv  and  pfv,T )  =  0  otherwise.  Min-Sum  Set 
Cover  is  the  special  case  in  which,  additionally,  tv  —  1  or  tv  —  oo  for  all  v  e  V.  The 
problem  of  constructing  efficient  sequences  of  trials  corresponds  to  the  case  in  which  we 
are  given  a  matrix  q,  and  p.i(v ,  r)  =  qV:i  if  r  >  1  and  pr{v,  r)  =  0  otherwise. 


2.1.5  Applications 

We  now  discuss  applications  of  the  results  presented  in  this  chapter.  The  first  applica¬ 
tion,  “Combining  multiple  heuristics  online”,  is  evaluated  experimentally  in  Chapter  3. 
Evaluating  the  remaining  applications  is  an  interesting  area  of  future  work. 


Combining  multiple  heuristics  online 

An  algorithm  portfolio  [38]  is  a  schedule  for  interleaving  the  execution  of  multiple  (ran¬ 
domized)  algorithms  and  periodically  restarting  them  with  a  fresh  random  seed.  Previous 
work  has  shown  that  combining  multiple  heuristics  for  NP-hard  problems  into  a  portfolio 
can  dramatically  reduce  average-case  running  time  [33,  38,  78].  In  particular,  algorithms 
based  on  chronological  backtracking  often  exhibit  heavy-tailed  run  length  distributions, 
and  periodically  restarting  them  with  a  fresh  random  seed  can  reduce  the  mean  running 
time  by  orders  of  magnitude  [34].  Our  algorithms  can  be  used  to  learn  an  effective  algo¬ 
rithm  portfolio  online,  in  the  course  of  solving  a  sequence  of  problem  instances.  Chapter 
3  considers  this  application  in  detail  and  presents  an  experimental  evaluation. 
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Database  query  optimization 


In  database  query  processing,  one  must  extract  all  the  records  in  a  database  that  satisfy 
every  predicate  in  a  list  of  one  or  more  predicates  (the  conjunction  of  predicates  comprises 
the  query).  To  process  the  query,  each  record  is  evaluated  against  the  predicates  one  at  a 
time  until  the  record  either  fails  to  satisfy  some  predicate  (in  which  case  it  does  not  match 
the  query)  or  all  predicates  have  been  examined.  The  order  in  which  the  predicates  are 
examined  affects  the  time  required  to  process  the  query.  Munagala  et  al.  [64]  introduced 
and  studied  a  problem  called  Pipelined  Set  Cover,  which  entails  finding  an  evaluation 
order  for  the  predicates  that  minimizes  the  average  time  required  to  process  a  record.  As 
discussed  in  §2.1.4,  Pipelined  Set  Cover  is  a  special  case  of  Min-Sum  Submodu- 
lar  Cover.  In  the  online  version  of  Pipelined  Set  Cover,  records  arrive  one  at  a  time 
and  one  may  select  a  different  evaluation  order  for  each  record.  In  our  terms,  the  records 
are  jobs  and  predicates  are  activities. 


Sensor  placement 

Sensor  placement  is  the  task  of  assigning  locations  to  a  set  of  sensors  so  as  to  maximize 
the  value  of  the  information  obtained  (e.g.,  to  maximize  the  number  of  intrusions  that 
are  detected  by  the  sensors).  Many  sensor  placement  problems  can  be  optimally  solved  by 
maximizing  a  monotone  submodular  set  function  subject  to  a  knapsack  constraint  [55].  As 
discussed  in  §2.1.4,  this  problem  is  a  special  case  of  Budgeted  Maximum  Submodu¬ 
lar  Coverage.  Our  online  algorithms  could  be  used  to  select  sensor  placements  when 
the  same  set  of  sensors  is  repeatedly  deployed  in  an  unknown  or  adversarial  environment. 


Viral  marketing 

Viral  marketing  infects  a  set  of  agents  (e.g.,  individuals  or  groups)  with  an  advertisement 
which  they  may  pass  on  to  other  potential  customers.  Under  a  standard  model  of  social 
network  dynamics,  the  total  number  of  potential  customers  that  are  influenced  by  the  ad¬ 
vertisement  is  a  submodular  function  of  the  set  of  agents  that  are  initially  infected  [48]. 
Previous  work  [48]  gave  an  algorithm  for  selecting  a  set  of  agents  to  initially  infect  so  as  to 
maximize  the  influence  of  an  advertisement,  assuming  the  dynamics  of  the  social  network 
are  known.  In  theory,  our  online  algorithms  could  be  used  to  adapt  a  marketing  campaign 
to  unknown  or  time- varying  social  network  dynamics. 
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2.2  Related  Work 


As  discussed  in  §2.1.4,  the  Min-Sum  Submodular  Cover  problem  introduced  in 
this  chapter  generalizes  several  previously-studied  problems,  including  Min-Sum  Set 
Cover  [26],  Pipelined  Set  Cover  [44,  64],  the  problem  of  constructing  efficient  se¬ 
quences  of  trials  [22],  and  the  problem  of  constructing  restart  schedules  [61,  35,  79]. 

Several  of  these  problems  have  been  considered  in  the  online  setting.  Munagala  et 
al.  [64]  gave  an  online  algorithm  for  Pipelined  Set  Cover  whose  O  (log  |V|)-regret  is 
o  (n),  where  n  is  the  number  of  records  (jobs).  Babu  et  al.  [9]  and  Kaplan  et  al.  [44]  gave 
online  algorithms  for  Pipelined  Set  Cover  whose  4-regret  is  o  (■ n ),  but  these  bounds 
hold  only  in  the  special  case  where  the  jobs  are  drawn  independently  at  random  from  a 
fixed  probability  distribution.  The  online  setting  in  this  chapter,  where  the  sequence  of 
jobs  may  be  arbitrary,  is  more  challenging  from  a  technical  point  of  view. 

As  already  mentioned,  Budgeted  Maximum  Submodular  Coverage  general¬ 
izes  the  problem  of  maximizing  a  monotone  submodular  set  function  subject  to  a  knapsack 
constraint.  Previous  work  gave  offline  greedy  approximation  algorithms  for  this  problem 
[56,  84],  which  generalized  earlier  algorithms  for  Budgeted  Maximum  Coverage 
[49]  and  Max  /c-Coverage  [65].  To  our  knowledge,  none  of  these  three  problems  have 
previously  been  studied  in  an  online  setting. 

It  is  worth  pointing  out  that  the  online  problems  we  consider  here  are  quite  different 
from  online  set  cover  problems  that  require  one  to  construct  a  single  collection  of  sets  that 
cover  each  element  in  a  sequence  of  elements  that  arrive  online  [1,7].  Likewise,  our  work 
is  orthogonal  to  work  on  online  facility  location  problems  [62]. 

The  main  technical  contribution  of  this  chapter  is  to  convert  some  specific  greedy  ap¬ 
proximation  algorithms  into  online  algorithms.  Recently,  Kakade  et  al.  [41]  gave  a  generic 
procedure  for  converting  an  a; -approximation  algorithm  for  a  linear  problem  into  an  on¬ 
line  algorithm  whose  q- regret  is  o  (n),  and  this  procedure  could  be  applied  to  the  problems 
considered  in  this  chapter.  However,  both  the  running  time  of  their  algorithm  and  the  re¬ 
sulting  regret  bounds  depend  on  the  dimension  of  the  linear  problem,  and  a  straightforward 
application  of  their  algorithm  leads  to  running  time  and  regret  bounds  that  are  exponential 


2.3  Offline  Algorithms 


In  this  section  we  consider  the  offline  problems  Budgeted  Maximum  Submodular 
Coverage  and  Min-Sum  Submodular  Cover.  In  the  offline  setting,  we  are  given 
as  input  a  job  /  :  S  — >  [0, 1].  Our  goal  is  to  compute  a  schedule  S  that  achieves  one 
of  two  objectives:  for  Budgeted  Maximum  Submodular  Coverage,  we  wish  to 
maximize  f(S)  subject  to  the  constraint  l  (. S )  <  T  (for  some  fixed  T  >  0),  while  for 
Min-Sum  Submodular  Cover,  we  wish  to  minimize  the  cost  c  (/,  S). 

The  offline  algorithms  presented  in  this  section  will  serve  as  the  basis  for  the  online 
algorithms  we  develop  in  the  next  section. 

Note  that  we  have  defined  the  offline  problem  in  terms  of  optimizing  a  single  job. 
However,  given  a  set  {/j,  /2, . . . ,  fn},  we  can  optimize  average  schedule  cost  (or  coverage) 
by  applying  our  offline  algorithm  to  the  job  f  —  \  X«=  i  h  (as  already  mentioned,  any 
convex  combination  of  jobs  is  a  job). 


2.3.1  Computational  complexity 

Both  of  the  offline  problems  considered  in  this  chapter  are  NP-hard  even  to  approximate. 
As  discussed  in  §2.1.4,  Min-Sum  Submodular  Cover  generalizes  Min-Sum  Set 
Cover,  and  Budgeted  Maximum  Submodular  Coverage  generalizes  Max  k- 
COVERAGE.  In  a  classic  paper,  Feige  proved  that  for  any  e  >  0,  acheiving  an  approxima¬ 
tion  ratio  of  1  —  2  -f  e  for  Max  /c-Coverage  is  NP-hard  [24].  Recently,  Feige,  Lovasz, 
and  Tetali  [26]  introduced  Min-Sum  Set  Cover  and  proved  that  for  any  e  >  0,  achiev¬ 
ing  a  4  —  e  approximation  ratio  for  Min-Sum  Set  Cover  is  NP-hard.  These  observations 
immediately  yield  the  following  theorems. 

Theorem  1.  For  any  e  >  0,  achieving  a  1  -  |  +  e  approximation  ratio  for  Budgeted 
Maximum  Submodular  Coverage  is  NP-hard. 

Theorem  2.  For  any  e  >  0,  achieving  a  4  —  e  approximation  ratio  for  Min-Sum  Sub¬ 
modular  Cover  is  NP-hard. 


2.3.2  Greedy  approximation  algorithm 

In  this  section  we  present  a  greedy  approximation  algorithm  that  can  be  used  to  achieve 
a  4  approximation  for  Min-Sum  Submodular  Cover  and  a  1  —  2  approximation  for 
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Budgeted  Maximum  Submodular  Coverage.  By  Theorems  1  and  2,  achieving  a 
better  approximation  ratio  for  either  problem  is  NP-hard. 


Consider  the  schedule  defined  by  the  following  simple  greedy  rule .  Let  G  —  (gi ,  g2 , . . . ) 
be  the  schedule  defined  inductively  as  follows:  G\  =  (),  G}  =  (g1,  g2, . . . ,  g:j-\)  for  j  >  1, 
and 


9j 


arg  max 

(t',r)eVxR>o 


(2.4) 


That  is,  G  is  constructed  by  greedily  appending  an  action  (v,  r)  to  the  schedule  so  as  to 
maximize  the  resulting  increase  in  /  per  unit  time. 

Once  we  reach  a  j  such  that  f(Gj)  =  1,  we  may  stop  adding  actions  to  the  schedule.  In 
general,  however,  G  may  contain  an  infinite  number  of  actions.  For  example,  if  each  action 
(v,  r)  represents  running  a  Las  Vegas  algorithm  v  for  time  r  and  f(S)  is  the  probability 
that  any  of  the  runs  in  S  return  a  solution  to  some  problem  instance  (see  Example  1),  it 
is  possible  that  f(S)  <  1  for  any  finite  schedule  S.  The  best  way  of  dealing  with  this 
is  application-dependent.  In  the  case  of  Example  1,  we  might  stop  computing  G  when 
f(Gj )  >1  —  5  for  some  small  5  >  0. 

The  time  required  to  compute  G  is  also  application-dependent.  In  the  applications 
of  interest  to  us,  evaluating  the  arg  max  in  (2.4)  will  only  require  us  to  consider  a  finite 
number  of  actions  ( v ,  r).  In  some  cases,  the  evaluation  of  the  arg  max  in  (2.4)  can  be  sped 
up  using  application-specific  data  structures.  In  Chapter  3,  we  discuss  the  time  required  to 
compute  G  for  various  applications  of  interest. 

As  mentioned  in  §2.1.2,  our  analysis  of  the  greedy  approximation  algorithm  will  only 
require  that  /  is  monotone  and  that  /  satisfies  Condition  1.  The  following  lemma  shows 
that  if  /  is  a  job,  then  /  also  satisfies  these  conditions. 


Lemma  1.  If  f  satisfies  (2.1),  then  f  satisfies  Condition  1.  That  is,  for  any  schedules 
Si,  S  G  S,  we  have 

f(Si  ®S)~  /(SO  ^  .  f  /(S,  ®  ((v,  r)»  -  /(Si) 

/)  /  r~i\  _  IXl&X  S 

l(S)  (v,r)eVxR>0  {  T 


Proof.  Let  r  denote  the  right  hand  side  of  the  inequality.  Let  S  =  (oq,  a2, . . . ,  aL),  where 
at  =  (■ vhTi ).  Let 


A i  —  f(Si  ©  (ai,  a-2,  ■  ■  ■ ,  cii ))  —  /(Si  ©  (ai,  d2, . .  • ,  d/_i))  . 
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We  have 


L 

f(Si  ©  S')  —  /(Si)  +  ^  A;  (telescoping  series) 

1=1 

L 

<  /(Si)  +  ^2  (f(s  1  ®  (a*))  “  / (Si))  (submodularity) 

z=i 
L 

</(s.)+E  r  •  r;  (definition  of  r) 

z=i 

=  /(50  +  r-€(5)  . 

Rearranging  this  inequality  gives  <  r,  as  claimed.  □ 

The  key  to  the  analysis  of  the  greedy  approximation  algorithm  is  the  following  fact, 
which  is  the  only  property  of  G  that  we  will  use  in  our  analysis. 

Fact  1.  For  any  schedule  S,  any  positive  integer  j,  and  any  t  >  0,  we  have 

f{S(t))  <  f(Gj)  +  t  ■  Sj 

where  Sj  is  the  jth  value  of  the  maximum  in  (2.4). 

Fact  1  holds  because  f(S^)  <  f(Gj  ©  S^)  by  monotonicity,  while  f(Gj  ©  S(t))  < 
f(Gj )  +  t  ■  Sj  by  Condition  1  and  the  definition  of  s3 . 


Maximizing  coverage 


We  first  analyze  the  performance  of  the  greedy  algorithm  on  the  Budgeted  Maximum 
Submodular  COVERAGE  problem.  The  following  theorem  shows  that,  for  certain  val¬ 
ues  of  T,  the  greedy  schedule  achieves  the  optimal  approximation  ratio  of  1  —  2  for  this 
problem.  The  proof  of  the  theorem  is  similar  to  arguments  in  [56,  84]. 

Theorem  3.  Let  L  he  a  positive  integer,  and  let  T  =  J2j=i  Tj>  where  g3  =  iv],  t3).  Then 
f  (G{t))  >  (1  -  i)  max,Se5  {/  (S(T>)}. 


Proof.  Let  C*  =  maxSgS  {/  (S(T\)  },  and  for  any  positive  integer/,  let  Aj  =  C*—f  ( Gj ). 
By  Fact  1,  C*  <  f  (Gj)  +  Tsj.  Thus 

A,  <TSj=T  ^ 
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Rearranging  this  inequality  gives  AJ+1  <  Aj  (l  —  Unrolling  this  inequality,  we  get 

Al+i  <  Ai  r)  ' 

Subject  to  the  constraint  J2j=i  Tj  —  T,  the  product  series  is  maximized  when  t:/  =  j-  for 
all  j.  Thus  we  have 

C*  -  f(GL+1)  =  Al+1  <  A,  (l  <  A,-  <  C*-  . 

V  L  J  e  e 

Thus  /  (GL+ i)  >  (1  —  ~)C*,  as  claimed.  □ 

Theorem  3  shows  that  G  gives  a  1  —  \  approximation  to  the  problem  of  maximizing 
coverage  at  time  T,  provided  that  T  equals  the  sum  of  the  durations  of  the  actions  in  Gj 
for  some  positive  integer  j.  Under  the  assumption  that  /  is  a  job  (as  opposed  to  the  weaker 
assumption  that  /  satisfies  Condition  1),  the  greedy  algorithm  can  be  combined  with  the 
partial  enumeration  approach  of  Khuller  et  al.  [49]  to  achieve  a  1  -  }  approximation  ratio 
for  any  fixed  T.  The  idea  of  this  approach  is  to  guess  a  sequence  Y  =  (ai,  a2,  a3)  of  three 
actions,  and  then  run  the  greedy  algorithm  on  the  job  f'(S)  =  f  (Y  ©  S)  —  f  (Y)  with 
budget  T  —  T0,  where  T0  is  the  total  time  consumed  by  the  actions  in  Y .  The  arguments 
of  [49,  84]  show  that,  for  some  choice  of  Y,  this  yields  a  (l  —  ^-approximation.  In  order 
for  this  approach  to  be  feasible,  actions  must  have  discrete  durations,  so  that  the  number 
of  possible  choices  of  Y  is  finite. 


Minimizing  cost 

We  next  analyze  the  performance  of  the  greedy  algorithm  on  the  Min-Sum  Submodu- 
lar  Cover  problem.  The  following  theorem  uses  the  proof  technique  of  [26]  to  show 
that  the  greedy  schedule  G  has  cost  at  most  4  times  that  of  the  optimal  schedule,  generaliz¬ 
ing  results  of  [26,  44,  64,  78,  79].  As  already  mentioned,  achieving  a  better  approximation 
ratio  is  NP-hard. 

Theorem  4.  c  (/,  G)  <  4  /“0  1  -  ma xSeS  {/  (S®) }  dt  <  4  minSe5  c  (/,  S). 

Proof.  Let  Rj  =  1—/  (Gj);  let  Xj  =  let  ijj  =  and  let  h(x)  =  1— maxg  {/  (5'(:k))  }. 

By  Fact  1, 

max  {/(%,))}  <  /  (Gj)  +  XjSj  =  f  (Gj)  +  ^  . 
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Thus  h(xj )  >  Rj  —  =  Vj.  The  monotonicity  of  /  implies  that  h(x)  is  non-increasing 

and  also  that  the  sequence  (2/1 , 2/2 ,  -  -  - )  is  non-increasing.  As  illustrated  in  Figure  2.1,  these 
facts  imply  that  f  ^=Q  h(x )  dx  >  Ylj>  1  xj  (Vj  ~  Vj+ 1)-  Thus  we  have 


max 

ses 


{/(■?«>)}* 


/i(x)  dx 


’  x=0 


>  ^2xj(yj-yj+i) 

j>  1 


-4^  ^ 

J>1  J 

1  „ 

=  4  2^  Riri 
3>  1 

>  |c(/,  ^9 


(Figure  2.1) 


(monotonicity  of  /) 


which  proves  the  theorem. 


□ 


x 


Figure  2.1:  An  illustration  of  the  inequality  f  ™0  h(x)  dx  >  Ylj>i  xj  (Vj  —  Vj+ 1 )  •  The  left 
hand  side  is  the  area  under  the  curve,  whereas  the  right  hand  side  is  the  sum  of  the  areas 
of  the  shaded  rectangles. 


A  refined  greedy  approximation  algorithm 

A  drawback  of  G  is  that  it  greedily  chooses  an  action  g3  =  (v.  r)  that  maximizes  the 
marginal  increase  in  /  divided  by  r,  whereas  the  contribution  of  (v,  r)  to  the  cost  of  G  is 
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not  r  but  rather 


1  -  /  (Gj  ©  {(v,t)})  dt . 


This  can  lead  G  to  perform  suboptimally  even  in  seemingly  easy  cases.  To  see  this,  let 
V  =  {vi,  v2},  let  Sj  =  ((vi,  t)),  and  let  Sf  =  ((v2,  t)).  Let  /  be  a  job  defined  by 


f(sl) 


1  if  £  >  1 
0  otherwise 


whereas 

I  (St)  =  min  {1,  t}  . 

For  any  schedule  S  =  (ai,  a2, _ ,  ol)  containing  more  than  one  action,  let  f(S)  = 

maxf=1  f({ai)).  It  is  straightforward  to  check  that  /  satisfies  the  monotonicity  and  sub¬ 
modularity  conditions  required  of  a  job. 

Here  the  optimal  schedule  is  S*  =  ((v2, 1)),  with  cost  c  (/,  S *)  =  J'=(j  1  —  tdt— 
However,  if  ties  in  the  evaluation  of  the  arg  max  in  (2.4)  are  broken  appropriately,  the 
greedy  algorithm  will  choose  the  schedule  G  =  ((iq,  1)),  with  cost  c  (/,  G)  =  1. 

To  improve  performance  in  cases  such  as  this,  it  is  natural  to  consider  the  schedule 
G'  =  {g'i,g'2,  ■  ■  •)  defined  inductively  as  follows:  G)  =  {g\ ,  g'2, . . . ,  g'j-i}  and 


g'j  =  arg  max 
(D,r)eVxl>o 


/(G;.©((n,r)))-/(G;.)  \ 
ftLo1  -  f  (G'j®  {(v,i)))  dt )  ' 


(2.5) 


Theorem  5  shows  that  G'  achieves  the  same  approximation  ratio  as  G.  The  proof  is 
similar  to  the  proof  of  Theorem  4,  and  is  given  in  Appendix  A. 

Theorem  5.  c  (/,  G')  <  4  /“Q  1  -  maxSe5  {/  (S{t))  }  dt  <  4  min5e5  {c  (/,  S)}. 

Furthermore,  we  prove  in  Chapter  3  (see  Theorem  18)  that,  in  contrast  to  G,  G'  is 
optimal  in  the  important  special  case  when  V  =  {n},  action  (v,  r)  represents  running  a  Las 
Vegas  algorithm  v  (with  a  fresh  random  seed)  for  time  r,  and  f(S)  equals  the  probability 
that  at  least  one  of  the  runs  in  S  returns  a  solution  to  some  particular  problem  instance  (as 
described  in  Example  1). 


Handling  non-uniform  additive  error 

We  now  consider  the  case  in  which  the  jth  decision  made  by  the  greedy  algorithm  is 
performed  with  some  additive  error  er  This  case  is  of  interest  for  two  reasons.  First,  in 
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some  cases  it  may  not  be  practical  to  evaluate  the  argmax  in  (2.4)  exactly.  Second,  and 
more  importantly,  we  will  end  up  viewing  our  online  algorithm  as  a  version  of  the  offline 
greedy  algorithm  in  which  each  decision  is  made  with  some  additive  error.  In  this  section 
we  analyze  the  original  greedy  schedule  G  as  opposed  to  the  refined  schedule  G'  described 
in  the  previous  section,  because  it  is  the  original  schedule  G  that  will  form  the  basis  of  our 
online  algorithm  (as  we  discuss  further  in  §2.5,  devising  an  online  algorithm  based  on  G' 
is  an  interesting  open  problem). 

We  denote  by  G  =  (g ,  ,  /y2 , . . . )  a  variant  of  the  schedule  G  in  which  the  jth  argmax 
in  (2.4)  is  evaluated  with  additive  error  er  More  formally,  G  is  a  schedule  that,  for  any 
j  >  1,  satisfies 

/  (Gj  ffi  a)  -  /  (Gj)  __  _____  f  /  (Gj  ©  {(v,  t)»  -  /  (Gj) 

- — - /  max  \  - 

Tj  (v,t)€VxR>o  I  r 


where  G0  =  (),  Gj  =  {gu  g2, . . . ,  gj-i)  for  j  >  1,  and  g3  =  (vj,  Tj). 

The  following  two  theorems  summarize  the  performance  of  G.  The  proofs  are  given 
in  Appendix  A,  and  are  along  the  same  lines  as  that  those  of  theorems  3  and  4. 

Theorem  6.  Let  L  be  a  positive  integer,  and  let  T  =  Xy=i  Tr  where  g:i  =  (v3,  fj).  Then 


f  (' G(t >)  >  ^1  - 


max 

S£S 


{/(v>)}-Ef 

5=1 


JTj 


Theorem  7.  Let  L  be  a  positive  integer,  and  let  T  =  Xj=i  T.v  where  g:)  =  (vj,  Tj).  For 
any  schedule  S,  define  cT  (/,  S)  =  jJ=Q  1  —  /  ( S ^)  dt.  Then 


cT  (/,  G)  <  4  /  1  -  max 


o=o 


5=1 


where  Ej  =  eiTi- 


2.4  Online  Algorithms 

In  this  section  we  consider  the  online  versions  of  Budgeted  Maximum  Submodular 
Coverage  and  Min-Sum  Submodular  Cover.  In  the  online  setting  we  are  fed, 
one  at  a  time,  a  sequence  (fi,  f2, . . . ,  fn)  of  jobs.  Prior  to  receiving  job  /,,  we  must 
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specify  a  schedule  S'*.  We  then  receive  complete  access  to  the  function  f,.  We  measure 
the  performance  of  our  online  algorithm  using  two  different  notions  of  regret.  For  the  cost 
objective,  our  goal  is  to  minimize  the  4-regret 


n 

Rcost  =  y2cT  (Si,  fi)  -  4  ■  min 

S 

2=1 


Ec(s,/o} 
i= 1  J 


for  some  fixed  T  >  0.  Here,  for  any  schedule  S  and  job  /,  we  define  cT  (S,  f)  = 
/'=0  1  —  /  (S'(/>)  dt  to  be  the  value  of  c  (S,  f)  when  the  integral  is  truncated  at  time  T. 
Some  form  of  truncation  is  necessary  because  c(Si,fi )  could  be  infinite,  and  without 
bounding  it  we  could  not  prove  any  finite  bound  on  regret  (our  regret  bounds  will  be  stated 
as  a  function  of  T). 

For  the  objective  of  maximizing  the  coverage  at  time  T,  our  goal  is  to  minimize  the 
(1  -  \) -regret 


Rcoverage  —  I  1 


max 

S€S 


2=1 


2=1 


where  we  require  that  E  {£  (S*)]  =  T,  in  expectation  over  the  online  algorithm’s  random 
bits.  In  other  words,  we  allow  the  online  algorithm  to  treat  T  as  a  budget  in  expectation, 
rather  than  a  hard  budget. 

Our  goal  is  to  bound  the  expected  values  of  Rcost  (resp.  Rcoverage)  on  the  worst-case 
sequence  of  n  jobs.  We  consider  the  so-called  oblivious  adversary  model ,  in  which  the 
sequence  of  jobs  is  fixed  in  advance  and  does  not  change  in  response  to  the  decisions 
made  by  our  online  algorithm,  although  we  believe  our  results  can  be  readily  extended 
to  the  case  of  adaptive  adversaries.  Note  that  the  constant  of  4  in  the  definition  of  Rcost 
and  the  constant  of  1  —  ^  in  the  definition  of  RCOVerage  stem  from  the  NP-hardness  of  the 
corresponding  offline  problems,  as  discussed  in  §2.3.1. 

For  the  purposes  of  the  results  in  this  section,  we  confine  our  attention  to  schedules  that 
consist  of  actions  that  come  from  some  finite  set  A,  and  we  assume  that  the  actions  in  A 
have  integer  durations  (i.e.  A  C  V  x  Z>0).  Note  that  this  is  not  a  serious  limitation,  because 
real-valued  action  durations  can  always  be  discretized  at  whatever  level  of  granularity  is 
desired. 

As  mentioned  in  §2.1.2,  our  results  in  the  online  setting  will  hold  for  any  sequence 
(/i,  f2, . . . ,  fn)  of  functions  that  satisfies  Condition  2.  The  following  lemma  shows  that 
any  sequence  of  jobs  satisfies  this  condition.  The  proof  follows  along  the  same  lines  as  the 
proof  of  Lemma  1,  and  is  given  in  Appendix  A. 
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Lemma  2.  Any  sequence  (/1;  /2, . . . ,  fn)  of  jobs  satisfies  Condition  2.  That  is,  for  any 
sequence  Si,  S2,  ■  ■  ■ ,  Sn  of  schedules  and  any  schedule  S, 

Ei,  m  ffi  S)  -  MS,)  „  ( E”„  MS,  e  ((v,  r)»  -  ms,) 

-  /  -  _  IllcLX.  \  - 

t  (S)  (f ,r)sVxR>0  (  T 

2.4.1  Background:  the  experts  problem 

In  the  experts  problem,  one  has  access  to  a  set  of  k  experts,  each  of  whom  gives  out  a  piece 
of  advice  every  day.  On  each  day  i,  one  must  select  an  expert  e,  whose  advice  to  follow. 
Following  the  advice  of  expert  j  on  day  i  yields  a  reward  xf  At  the  end  of  day  i,  the  value 
of  the  reward  x'}  for  each  expert  j  is  made  public,  and  can  be  used  as  the  basis  for  making 
choices  on  subsequent  days.  One’s  regret  at  the  end  of  n  days  is  equal  to 

{n  ^  n 

/  x\  z  —  /  xl  ■ 

tr  JJ  tr  8 

Note  that  the  historical  performance  of  an  expert  does  not  imply  any  guarantees  about  its 
future  performance.  Remarkably,  randomized  decision-making  algorithms  nevertheless 
exist  whose  regret  grows  sub-linearly  in  the  number  of  days.  By  picking  experts  using 
such  an  algorithm,  one  can  guarantee  to  obtain  (asymptotically  as  n  — >  00)  an  average 
reward  that  is  as  large  as  the  maximum  reward  that  could  have  been  obtained  by  following 
the  advice  of  any  fixed  expert  for  all  n  days. 

In  particular,  for  any  fixed  value  of  Gmax,  where  Gmax  =  maxi<j<fe  { fffi=  1  x)  }  >  the 
randomized  weighted  majority  algorithm  (WMR)  [60]  can  be  used  to  achieve  worst-case 
regret  O  ( \/Gmax  In  k) .  If  Gmax  is  not  known  in  advance,  a  putative  value  can  be  guessed 
and  doubled  to  achieve  the  same  guarantee  up  to  a  constant  factor. 

2.4.2  Unit-cost  actions 

In  the  special  case  in  which  each  action  takes  unit  time  (i.e.,  A  C  V  x  {1}),  our  online 
algorithm  OGunit  is  very  simple.  OGunit  runs  T  experts  algorithms:2  £j  .  E2,  ■  ■  ■ ,  ST, 
where  T  is  the  number  of  time  steps  for  which  our  schedule  is  defined.  The  set  of  experts  is 
A.  Just  before  job  f  ,  arrives,  each  experts  algorithm  £t  selects  an  action  a\.  The  schedule 
used  by  OGunit  on  job  f  is  St  =  (aj,  a\, . . . ,  aj).  The  payoff  that  £t  associates  with 
action  a  is  fi  (S'i{t_1)  ©  a)  -  f%  (Si{t_ i>). 

2In  general,  £1,82 ,  ■  ■  ■  ,£t  will  be  T  distinct  copies  of  a  single  experts  algorithm,  such  as  randomized 
weighted  majority. 
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Algorithm  OGunit 

Input:  integer  T,  experts  algorithms  £\ .  £>,  ■  ■  ■  ■  £t- 
For  i  from  1  to  n: 

1.  For  each  t,  1  <  t  <  T,  use  £t  to  select  an  action  alt. 

2.  Select  the  schedule  Si  =  (aj,  al2, . . . ,  alT). 

3.  Receive  the  job  f  , . 

4.  For  each  t,  1  <  t  <  T,  and  each  action  a  E  A,  feed  back 
fi  {Si(t_ i)  ©  a)  —  fi  ( Si as  the  payoff  £t  would  have  received 
by  choosing  action  a. 


Let  rt  be  the  regret  experienced  by  experts  algorithm  £t  when  running  OGunit,  and  let 
R  =  EIa,  The  key  to  the  analysis  of  OGun;t  is  the  following  lemma,  which  relates  the 
regret  experienced  by  the  experts  algorithms  to  the  regret  on  the  original  online  problem. 

Lemma  3.  Rcm}erage  <  R  and  Rcost  <TR. 

Proof.  We  will  view  OGunit  as  producing  an  approximate  version  of  the  offline  greedy 
schedule  for  the  function  /  =  \  First,  view  the  sequence  of  actions  selected  by  £t 

as  a  single  “meta-action”  at,  and  extend  the  domain  of  each  /',  to  include  the  meta-actions 
by  defining  ffS  ©  at )  =  f,(S  ©  a})  for  all  S  e  S.  Thus,  the  online  algorithm  produces  a 
single  schedule  Si  —  S  —  (Si,  a2, . . . ,  dr)  for  all  i.  By  construction, 

n  =  {f  ®a)  ~f  (^- 1>)  }  -  (/  ©  ©)  -  /  (S(t- 1>))  • 

Thus  OGunit  behaves  exactly  like  the  greedy  schedule  G  for  the  function  /,  where  the  tth 
decision  is  made  with  additive  error  — . 

n 

Furthermore,  the  fact  that  the  sequence  (/j,  /2, . . . ,  fn)  satisfies  Condition  2  implies 
that  for  any  integer  t  (1  <  t  <  T)  and  any  schedule  S,  we  have 

/OV 1)  ®s)~  f(S{t.  1})  ^  f  /(5{t_1}  ©  ((«,  r)))  -  d) 

- — — — - ©  max  <  - 

i{£)  (D,r)evxi>0  I  r 

Thus  the  function  /  satisfies  Condition  1,  so  the  analysis  of  the  greedy  approximation 
algorithm  in  §2.3.2  applies  to  the  schedule  S.  In  particular,  Theorem  6  implies  that 
Rcoverage  G  Zhn  =  R.  Similarly,  Theorem  7  implies  that  Rcost  <TR.  □ 


32 


To  complete  the  analysis,  it  remains  to  bound  E  [R\.  First,  note  that  the  payoffs  to  each 
experts  algorithm  £t  depend  on  the  choices  made  by  experts  algorithms  £\ ,  S2, ... ,  £t-i, 
but  not  on  the  choices  made  by  £t  itself.  Thus,  from  the  point  of  view  of  £t,  the  payoffs 
are  generated  by  a  non-adaptive  adversary.  Suppose  that  randomized  weighted  majority 
(WMR)  is  used  as  the  subroutine  experts  algorithm.  Because  each  payoff  is  at  most  1  and 
there  are  n  rounds,  E  [rt\  =  O  ^Gmax  In  \A\^j  =  O  (y/nln  \Afj,  so  a  trivial  bound  is 

E  [R]  =  O  (t  yGrln  j^4[j .  In  fact,  we  can  show  that  the  worst  case  is  when  Gmax  =  ©  (^) 
for  all  T  experts  algorithms,  leading  to  the  following  improved  bound.  The  proof  is  given 
in  Appendix  A. 


Lemma  4.  Algorithm  OGunjt,  run  with  WMR  as  the  subroutine  experts  algorithm,  has 
E  [R]  =  O  (y/Tnln\A\)  in  the  worst  case. 


Combining  Lemmas  3  and  4  yields  the  following  theorem. 


Theorem  8.  Algorithm  OGunit,  run  with  WMR  as  the  subroutine  experts  algorithm,  has 
E  [ Rcoverage }  =  O  ^ \jTn  in  and  E  [Rcost]  =  O  (t sff  n  in  |,4|  j  in  the  worst  case. 


2.4.3  From  unit-cost  actions  to  arbitrary  actions 

In  this  section  we  generalize  the  online  greedy  algorithm  presented  in  the  previous  section 
to  accommodate  actions  with  arbitrary  durations.  Like  OGunit,  our  generalized  algorithm 
OG  makes  use  of  a  series  of  experts  algorithms  £1}  £2, . . .  ,£l  (for  L  to  be  determined). 
On  each  round  i,  OG  constructs  a  schedule  St  as  follows:  for  t  —  1,  2, . . . ,  L,  it  uses  £t  to 
choose  an  action  a\  =  (v.  r)  G  A,  and  appends  this  action  to  St  with  probability  K  The 
payoff  that  £t  associates  with  action  a  equals  f  times  the  increase  in  /  that  would  have 
resulted  from  appending  a  to  the  schedule-under-construction. 
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Algorithm  OG 

Input:  integer  L,  experts  algorithms  £\.£2,....  El- 
For  i  from  1  to  n: 

1.  Let  Sift  =  (}  be  the  empty  schedule. 

2.  For  each  t,  1  <  t  <  L, 

(a)  Use  Et  to  choose  an  action  alt  =  (v,  r)  G  A. 

(b)  With  probability  ),  set  Sijt  =  S),t-i©(a);  else  set  S^t  =  Sijt- i- 

3.  Select  the  schedule  S'*  =  Si)L. 

4.  Receive  the  job  f  , . 

5.  For  each  t,  1  <  t  <  L,  and  each  action  a  G  A,  feed  back 

A, a  =  ~  ( fi  (Si,t- 1  ©  a)  -  fi  1)) 

r 

as  the  payoff  £t  would  have  received  by  choosing  action  a. 


Our  analysis  of  OG  follows  along  the  same  lines  as  the  analysis  of  OGunit  in  the 
previous  section.  As  in  the  previous  section,  we  will  view  each  experts  algorithm  Et  as 
selecting  a  single  “meta-action”  at-  We  extend  the  domain  of  each  f,  to  include  the  meta¬ 
actions  by  defining 


fi(S  ©  at) 


fi(S®a\)  if  a)  was  appended  to  St 
fi(S )  otherwise. 


Thus,  the  online  algorithm  produces  a  single  schedule  Si  =  S  =  (d\,  a2, . . . ,  ai)  for  all  i. 


For  the  purposes  of  analysis,  we  will  imagine  that  each  meta-action  at  always  takes 
unit  time  (whereas  in  fact,  at  takes  unit  time  per  job  in  expectation).  We  show  later  that 
this  assumption  does  not  invalidate  any  of  our  arguments. 

Let  f  —  \  YAi= i  fi >  and  lot  St  =  (di,  a2, . . . ,  at).  As  in  the  previous  section,  the  fact 
that  the  sequence  (/i,  /2, . . . ,  fn)  satisfies  Condition  2  implies  that  /  satisfies  Condition  1 
(even  if  the  schedule  S\  in  the  statement  of  Condition  1  contains  meta- actions).  Thus  S 
can  be  viewed  as  a  version  of  the  greedy  schedule  in  which  the  tth  decision  is  made  with 
additive  error  (by  definition)  equal  to 

€t  =  (^-feA  {  r  ©  a)  -  f(St_ i))  |  -  (f(St- 1  ©  at )  -  /(S't-i)) 
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(where  we  have  used  the  assumption  that  at  takes  unit  time). 

As  in  the  previous  section,  let  rt  be  the  regret  experienced  by  £t.  In  general,  —  ^  et. 
However,  we  claim  that  E  [et\  =  E  [— ] .  To  see  this,  fix  some  integer  t  (1  <  t  <  L ),  let 
At  =  (a),  af, . . . ,  a”)  be  the  sequence  of  actions  selected  by  £t,  and  let  y\  be  the  payoff 
received  by  £t  on  round  i  (i.e.,  yl  =  xl,  ;)•  By  construction, 

t,at 


vl 


=  E 


fi{St- 1  (B  at)  —  fi(St-i)\At,  St-i 


Thus, 


n 

n 


max 

(v,t)£A 


~  (/(‘S't-i  ©  o)  — 


E 


f(St- 1 


©  dt)  —  f  (St-i)\At,  St~ i 


Taking  the  expectation  of  both  sides  of  the  equations  for  et  and  rt  then  shows  that  E  [e*]  = 
E  [^] ,  as  claimed. 

We  now  prove  a  bound  on  E  [ RCOVerage ]•  As  already  mentioned,  /  satisfies  Condition  1, 
so  the  greedy  schedule’s  approximation  guarantees  apply  to  /.  In  particular,  by  Theorem 
6,  we  have  Rcoverage  <  Y%=  i  rt-  Thus  E  iR coverage }  <  E  [Ft],  where  R  =  EL  rt- 

To  bound  E  [ Rcoverage ],  it  remains  to  justify  the  assumption  that  each  meta-action  at 
always  takes  unit  time.  Regardless  of  what  actions  are  chosen  by  each  experts  algorithm, 
the  schedule  is  defined  for  L  time  steps  in  expectation.  Thus  if  we  set  L  =  T,  the  sched¬ 
ules  Si  returned  by  OG  satisfy  the  budget  in  expectation,  as  required  in  the  definition  of 
Rcoverage ■  Thus,  as  far  as  Rcoverage  is  concerned,  the  meta-actions  may  as  well  take  unit 
time  (in  which  case  i  (Si)  =  T  with  probability  1).  Combining  the  bound  on  E  [R]  stated 
in  Lemma  4  with  the  fact  that  E  [RCOverage\  <  E  |  /?|  yields  the  following  theorem. 

Theorem  9.  Algorithm  OG,  run  with  input  L  =  T,  has  E  [Rcoverage]  <  E  [R].  IfWMR  is 
used  as  the  subroutine  experts  algorithm,  then  E  [R]  —  O  ( \JT n  In  |  j . 


The  argument  bounding  E  [Rcost]  is  similar,  although  somewhat  more  involved,  and  is 
given  in  Appendix  A.  Relative  to  the  case  of  unit-cost  actions  addressed  in  the  previous 
section,  the  additional  complication  here  is  that  £  (Si)  is  now  a  random  variable,  whereas 
in  the  definition  of  Rcost  the  cost  of  a  schedule  is  always  calculated  up  to  time  T.  This 
complication  can  be  overcome  by  making  the  probability  that  i  (Si)  <  T  sufficiently  small, 
which  can  be  accomplished  by  setting  L  »  T  and  applying  concentration  inequalities. 
However,  E  [R]  grows  as  a  function  of  L,  so  we  do  not  want  to  make  L  too  large.  It  turns 
out  that  the  (approximately)  best  bound  is  obtained  by  setting  L  =  T  In  n. 
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Theorem  10.  Algorithm  OG,  run  with  input  L  —  Tin  n,  has  E  | Rrost]  —  0(T  In  n  ■  E  [R] 
+  T y/n).  In  particular,  E  [. Rcost ]  =  O  ([\nn)^TyjTn\n\A\}  if  WMR  is  used  as  the 
subroutine  experts  algorithm. 

2.4.4  Dealing  with  limited  feedback 

Thus  far  we  have  assumed  that,  after  specifying  a  schedule  Sl7  the  online  algorithm  re¬ 
ceives  complete  access  to  the  job  /).  We  now  consider  three  more  limited  feedback  settings 
that  may  arise  in  practice: 

1 .  In  the  priced  feedback  model ,  to  receive  access  to  f,  we  must  pay  a  price  C.  Each 
time  we  do  so,  C  is  added  to  the  regret  RCOVerage ,  and  TC  is  added  to  the  regret  Rcost. 

2.  In  the  partially  transparent  feedback  model ,  we  only  observe  /,  (>Sj(t))  for  each 
t  >  0. 

3.  In  the  opaque  feedback  model ,  we  only  observe  f  (Si). 

The  priced  and  partially  transparent  feedback  models  arise  naturally  in  the  case  where 
action  (v,  r)  represents  running  a  deterministic  algorithm  v  for  r  (additional)  time  units  in 
order  to  solve  some  decision  problem.  Assuming  we  halt  once  some  v  returns  an  answer, 
we  obtain  exactly  the  information  that  is  revealed  in  the  partially  transparent  model.  Al¬ 
ternatively,  running  each  v  until  it  terminates  would  completely  reveal  the  function  fu  but 
incurs  a  computational  cost. 

Algorithm  OG  can  be  adapted  to  work  in  each  of  these  three  feedback  settings.  In  all 
cases,  the  high-level  idea  is  to  replace  the  unknown  quantities  used  by  OG  with  (unbi¬ 
ased)  estimates  of  those  quantities.  This  technique  has  been  used  in  a  number  of  online 
algorithms  (e.g.,  see  [5,  8,  17]). 

Specifically,  for  each  day  i  and  expert  j,  let  x)  e  [0,1]  be  an  estimate  of  xf  such  that 

E  [x]]  =  yx)  +  Si 

for  some  constant  5l  (which  is  independent  of  j).  In  order  words,  we  require  that  -  (xj  —  6l) 
is  an  unbiased  estimate  of  xf  Furthermore,  let  x)  be  independent  of  the  choices  made  by 
the  experts  algorithm. 

Let  8  be  an  experts  algorithm,  and  let  £'  be  the  experts  algorithm  that  results  from 
feeding  back  x1-  to  8  (in  place  of  xf)  as  the  payoff  8  would  have  received  by  selecting 
expert  j  on  day  i.  The  following  lemma  relates  the  performance  of  8'  to  that  of  8. 
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Lemma  5.  The  worst-case  expected  regret  that  £'  can  incur  over  a  sequence  of  n  days  is 
at  most  where  R  is  the  worst-case  expected  regret  that  £  can  incur  over  a  sequence  of 
n  days. 


Proof  Let  x  —  x  ,  x  ^ ...  ^  x  )  be  the  sequence  of  estimated  payoffs.  Because  the  esti¬ 
mates  xlj  are  independent  of  the  choices  made  by  £' .  we  may  imagine  for  the  purposes  of 
analysis  that  x  is  fixed  in  advance.  Fix  some  expert  j.  By  definition  of  R, 
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Taking  the  expectation  of  both  sides  with  respect  to  the  choice  of  x  then  yields 
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Because  j  was  arbitrary,  it  follows  that  £'  has  worst-case  expected  regret  — . 


□ 


The  priced  feedback  model 

In  the  priced  feedback  model,  we  use  a  technique  similar  to  that  of  [17].  With  probabiltiy 
7,  we  will  pay  cost  C  in  order  to  reveal  /*,  and  then  feed  the  usual  payoffs  back  to  each 
experts  algorithm  £t.  Otherwise,  with  probability  1  —  7,  we  feed  back  zero  payoffs  to 
each  £t  (note  that  without  paying  cost  C,  we  receive  no  information  whatsoever  about  f  , , 
and  thus  we  have  no  basis  for  assigning  different  payoffs  to  different  actions).  We  refer  to 
this  algorithm  as  OGp.  By  Lemma  5,  E  [rt]  is  bounded  by  -  times  the  worst-case  regret 
of  £t.  By  bounding  E  [ RCOverage ]  and  E  [Rcost\  as  a  function  of  7  and  then  optimizing  7 
to  minimize  the  bounds,  we  obtain  the  following  theorem,  a  complete  proof  of  which  is 
given  in  Appendix  A. 

Theorem  11.  Algorithm  OGp,  run  with  WMR  as  the  subroutine  experts  algorithm,  has 
E  [Rcoverage]  =  O  (jC  In  \A\)^(Tn)^  (when  run  with  input  L  =  T)  and  has  E  [ Rcost }  = 

0  ^(Tlnn)i(Cln  |7l|)5(n)i  j  (when  run  with  input  L  =  Thin)  in  the  priced  feedback 
model. 
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The  partially  transparent  feedback  model 

In  the  partially  transparent  feedback  model,  each  £t  will  run  a  copy  of  the  Exp3  algorithm 
[5],  which  is  a  randomized  experts  algorithm  that  only  requires  as  feedback  the  payoff 
of  the  expert  it  actually  selects.  In  the  partially  transparent  feedback  model,  if  £t  selects 
action  a\  =  (v,  r)  on  round  i,  it  will  receive  feedback  fa  1  ©  a\)  —  fa  i)  if  a\  is 
appended  to  the  schedule  (with  probability  ^),  and  will  receive  zero  payoff  otherwise.  Ob¬ 
serve  that  the  information  necessary  to  compute  these  payoffs  is  revealed  in  the  partially 
transparent  feedback  model.  Furthermore,  the  expected  payoff  that  Et  receives  if  it  selects 
action  a  is  x\  a,  and  the  payoff  that  £t  receives  from  choosing  action  a  on  round  i  is  inde¬ 
pendent  from  the  choices  made  by  £t  on  previous  rounds.  Thus,  by  Lemma  5,  the  worst- 
case  expected  regret  bounds  of  Exp3  can  be  applied  to  the  true  payoffs  x\a.  The  worst- 

case  expected  regret  of  Exp3  is  O  ^n\A\ln\A\y  so  E  [R]  =  O  (Ly/n\A\ki\A\). 
This  bound,  combined  with  Theorems  9  and  10,  establishes  the  following  theorem. 

Theorem  12.  Algorithm  OG,  run  with  Exp3  as  the  subroutine  experts  algorithm,  has 
E  [RCoverage}  =  O  (t a J'ti  |^4|  In  \A\j  (when  run  with  input  L  =  T)  and  has  E  [Roost]  = 

O  ^  ( T  In  n ) 2  \Jri  | .4 1  In  |^4  j  (when  run  with  input  L  =  Thin)  in  the  partially  transparent 
feedback  model. 


The  opaque  feedback  model 


In  the  opaque  feedback  model,  our  algorithm  and  its  analysis  are  similar  to  those  of  OGp. 
With  probability  1  —  7,  we  feed  back  zero  payoffs  to  each  £t.  Otherwise,  with  probability 
7,  we  explore  as  follows.  Pick  t  uniformly  at  random  from  (1,2,...,  L),  and  pick  an 
action  a  =  (v,  r)  uniformly  at  random  from  A.  Select  the  schedule  St  =  St  J- 1  ©  a. 
Observe  ffSf),  and  feed  ^  times  this  value  back  to  £,  as  the  payoff  associated  with  action 
a.  Finally,  feed  back  zero  for  all  other  payoffs. 


We  refer  to  this  algorithm  as  OG°.  The  key  to  its  analysis  is  the  following  observation. 
Letting  x\a  denote  the  payoff  to  experts  algorithm  £t  for  choosing  action  a  =  (v.  r)  on 
round  i,  we  have 


E  [©1  = 7  •  r  Ri  • ;  •  /(S,’i-1 


a)  = 


—1 _ fa  _l  fa 

L\A\t’a  + 


where  x\a  =  \  1  ©  a)  -  1))  and  5i  =  1).  Thus,  x\a 

estimate  of  the  correct  payoff,  and  Lemma  5  implies  that  E  [rt]  is  at  most 
worst-case  expected  regret  of  £t. 


is  a  biased 
times  the 
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The  performance  of  OG°  is  summarized  in  the  following  theorem,  which  we  prove  in 
Appendix  A. 

Theorem  13.  Algorithm  OG°,  run  with  WMR  as  the  subroutine  experts  algorithm,  has 
E  [ RCOverage\  =  O  (t(  \A\  In  |*4| )  ^  (when  run  with  input  L  =  T)  and  has  E  [Rcost]  — 

O  ^(Tlnn)2(|^4|  In  \A\)^n^  (when  run  with  input  L  =  Than)  in  the  opaque  feedback 
model. 


2.4.5  Lower  bounds  on  regret 

In  Appendix  A  we  prove  the  following  lower  bounds  on  regret.  The  lower  bounds  apply  to 
the  online  versions  of  two  set-covering  problems:  Max  AaCoverage  and  Min-Sum  Set 
Cover.  The  offline  versions  of  these  two  problems  were  defined  in  §2.1 .4.  The  online  ver¬ 
sions  are  special  cases  of  the  online  versions  of  Budgeted  Maximum  Submodular 
Coverage  and  Min-Sum  Submodular  Cover,  respectively.  For  a  formal  descrip¬ 
tion  of  the  online  set  covering  problems,  see  the  text  leading  up  to  the  proofs  of  Theorems 
14  and  15  in  Appendix  A. 

It  is  worth  pointing  out  that  the  lower  bounds  hold  even  in  a  distributional  online 
setting  in  which  the  jobs  f\  ,  /2 , . . . ,  fn  are  drawn  independently  at  random  from  a  fixed 
distribution. 


Theorem  14.  Any  algorithm  for  online  Max  A-Coverage  has  worst-case  expected  1 
regret  O 


Tn  In  ^  ,  where  V  is  the  collection  of  sets  and  T  =  k  is  the  number  of  sets 


selected  by  the  online  algorithm  on  each  round. 

Theorem  15.  Any  algorithm  for  online  Min-Sum  Set  Cover  has  worst-case  expected 
1 -regret  Q  ( T  J Tn  In  ^  1  ,  where  V  is  a  collection  of  sets  and  T  is  the  number  of  sets 


selected  by  the  online  algorithm  on  each  round. 


In  Appendix  A  we  show  that  there  exist  exponential-time  online  algorithms  for  these 
online  set  covering  problems  whose  regret  matches  the  lower  bounds  in  Theorem  14  (resp. 
Theorem  15)  up  to  constant  (resp.  logarithmic)  factors. 

Note  that  the  upper  bounds  in  Theorem  8  match  the  lower  bounds  in  Theorems  14  and 
15  up  to  logarithmic  factors,  although  the  former  apply  to  (1  —  A ) -regret  and  4-regret  rather 
than  1 -regret. 
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2.4.6  Refining  the  online  greedy  algorithm 

We  now  discuss  two  simple  modifications  to  OG  that  do  not  improve  its  worst-case  guar¬ 
antees,  but  that  often  improve  its  performance  in  practice  (we  make  use  of  both  of  these 
modifications  in  our  experiments  in  Chapter  3). 


Avoiding  duplicate  actions 

In  many  practical  applications,  it  is  never  worthwhile  to  perform  the  same  action  twice. 
As  an  example,  suppose  that  an  action  a  =  (v,  r)  represents  performing  a  run  of  length  r 
of  a  deterministic  algorithm  v  (and  then  removing  the  run  from  memory),  and  f(S')  =  1 
if  performing  the  actions  in  S  yields  a  solution  to  a  problem  instance,  and  /(.S')  =  0 
otherwise.  Clearly,  performing  a  twice  can  never  increase  the  value  of  /.  In  cases  such  as 
this,  the  online  algorithm  OG  as  currently  defined  may  never  “figure  out”  that  it  should 
avoid  performing  the  same  action  twice,  as  the  following  example  illustrates. 

Example  2.  Let  A  =  {a.i,a2, . . .  ,aT}  be  a  set  of  T  actions  that  each  take  unit  time, 
and  for  all  i,  let  fi(S)  equal  4  times  the  number  of  distinct  actions  that  appear  in  S. 
Thus,  the  schedule  S*  =  (ai,a2, . . . ,  aT)  has  fi(S*)  =  1  for  all  i ,  and  is  optimal  in 
terms  of  coverage.  Suppose  we  run  OG  on  the  sequence  of  jobs  (/i,  /2, . . . ,  /„).  All 
actions  yield  equal  payoff  to  E\ .  If  E\  is  a  standard  experts  algorithm  such  as  randomized 
weighted  majority,  it  will  choose  actions  uniformly  at  random.  Given  that  E\  chooses 
actions  uniformly  at  random,  E-2  will  (asymptotically)  choose  actions  uniformly  at  random 
as  well.  Inductively,  all  actions  will  be  chosen  at  random.  If  so,  the  probability  that  any 
particular  experts  algorithm  selects  a  unique  action  is  1  —  (1  —  4)T  (which  approaches 
1  —  \  as  T  — >  oo).  By  linearity  of  expectation,  the  expected  fraction  of  actions  that  are 
unique  is  exactly  this  quantity. 

To  improve  performance  on  examples  such  as  this  one,  we  may  force  the  online  algo¬ 
rithm  to  return  a  schedule  with  no  duplicate  actions  as  follows.  Just  before  job  /)  arrives, 
obtain  from  each  experts  algorithm  Et  a  distribution  over  A  (for  experts  algorithms  such 
as  randomized  weighted  majority,  it  is  straightforward  to  obtain  this  distribution  explic¬ 
itly).  We  then  sample  from  these  distributions  as  follows.  We  first  sample  from  E\  to 
obtain  an  action  a\ .  To  obtain  action  a\  for  t  >  1,  we  repeatedly  sample  from  the  distri¬ 
bution  returned  by  Et  until  we  obtain  an  action  not  in  the  set  {  a) ,  a(> , . . . ,  aj_x}  (given  the 
distribution,  we  can  simulate  this  step  without  actually  performing  repeated  sampling). 

With  this  modification,  OG  always  achieves  coverage  1  for  the  job  /  described  in 
example  2.  Furthermore,  this  modification  preserves  the  worst-case  guarantees  of  the 
original  version  of  OG  (under  the  assumption  performing  the  same  action  twice  never 
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increases  the  value  of  any  function  /,).  Informally,  this  follows  from  the  fact  that  the 
expected  payoff  received  by  sampling  from  the  modified  distribution  can  never  be  smaller 
than  the  expected  payoff  received  by  sampling  from  the  original  distribution  (because  the 
payoffs  associated  with  the  experts  corresponding  to  actions  already  in  the  schedule  are 
all  zero).  For  this  reason,  this  modification  never  increases  the  worst-case  regret  of  the 
experts  algorithms,  and  our  previous  analysis  carries  through  unchanged. 


Independent  versus  dependent  probabilities 

Recall  that  in  the  case  of  arbitrary-cost  actions,  when  an  experts  algorithm  selects  an  action 
(v,  r)  we  add  this  action  to  the  schedule  independently  with  probability  K  The  fact  that 
this  addition  is  performed  independently  of  the  actions  that  are  already  in  the  schedule  can 
lead  to  undesirable  behavior,  as  the  following  example  illustrates. 

Example  3.  Let  V  =  {u}  consist  of  a  single  activity,  let  f(S)  —  1  if  S  contains  the  action 
(v,  T),  and  let  /(S')  =  0  otherwise.  Thus,  the  schedule  S*  =  (( v ,  T))  maximizes  f(S(r))- 
However,  E  [/(S')]  <  1  —  (1  —  ^)TifSisa  schedule  returned  by  OG.  This  is  true 
because  at  most  T  experts  algorithms  can  select  the  action  (v,T),  but  in  each  case  the 
action  is  only  added  to  the  schedule  with  probability  so  the  probability  that  (v,T)  is 
added  to  the  schedule  is  at  most  1  —  (1  —  R)T ,  which  approaches  1  —  \  as  T  — ■>  oo. 

We  can  fix  this  problem  as  follows.  When  experts  algorithm  £t  selects  an  action  at  = 
( v ,  r),  we  increase  the  probability  that  the  action  is  in  the  schedule  by  C  In  other  words, 
if  at  has  been  picked  by  k  experts  algorithms  so  far  but  has  still  not  been  added  to  the 
schedule,  then  we  add  it  to  the  schedule  with  probability  Thus,  if  r  consecutive 
experts  algorithms  select  the  same  action  (v,  r),  it  will  always  be  added  to  the  schedule 
exactly  once. 

The  schedules  produced  by  this  modified  online  algorithm  still  consume  T  time  steps 
in  expectation,  and  our  previous  analysis  carries  through  to  give  same  regret  bounds  on 
Rcoverage  that  were  stated  in  Theorem  9.  Unfortunately,  the  analysis  for  the  bounds  on  Rcost 
stated  in  Theorem  10  depends  critically  on  the  use  of  independent  probabilities,  and  does 
not  carry  through  after  having  made  this  modification.  Nevertheless,  in  our  experiments  in 
Chapter  3  we  found  that  this  modification  was  helpful  in  practice. 


2.5  Open  Problems 

The  results  presented  in  this  chapter  suggest  several  open  problems: 
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1.  Avoiding  discretization.  As  currently  defined,  our  online  algorithm  can  only  handle 
finite  set  of  actions  A.  Thus,  to  apply  this  online  algorithm  to  a  problem  in  which  the 
actions  have  real- valued  durations  between  0  and  1,  one  might  discretize  the  dura¬ 
tions  to  be  in  the  set  {  ^ ^ , . . . ,  1 } .  To  achieve  the  best  performance,  one  would  like 
to  set  T  as  large  as  possible,  but  the  time  and  space  required  by  the  online  algorithm 
grow  linearly  with  T.  It  would  be  desirable  to  avoid  discretization  altogether,  per¬ 
haps  after  making  additional  smoothness  assumptions  about  the  jobs  /j.  A  possible 
approach  would  be  to  consider  the  limiting  behavior  of  our  algorithm  as  T  — >  oo, 
for  some  particular  choice  of  subroutine  experts  algorithm. 

2.  Lower  bounds  on  4-regret  and  1  —  ^  regret.  The  lower  bounds  proved  in  §2.4.5 
apply  only  to  1 -regret,  whereas  our  online  algorithms  optimize  either  4  regret  (in 
the  case  of  Rcost )  or  1  —  ^  regret  (in  the  case  of  Reaver  age)-  It  would  be  interesting 
to  prove  lower  bounds  on  Rcost  and  Rcoverage-  Such  lower  bounds  would  hold  for 
online  algorithms  that  make  decisions  in  polynomial  time,  under  the  assumption  that 
P  f  NP. 

3.  An  online  version  of  the  refined  greedy  approximation  algorithm  G' .  Recall  that  in 
§2.3.2  we  showed  that  the  offline  greedy  approximation  algorithm  is  sub-optimal 
for  a  simple  job  involving  two  activities,  and  then  considered  an  alternative  greedy 
approximation  algorithm  that  produces  an  optimal  schedule  for  this  job.  The  online 
algorithm  presented  in  §2.4  is  based  on  the  original  greedy  approximation  algorithm, 
and  thus  it  also  performs  sub-optimally  on  this  simple  example.  Although  it  appears 
non-trivial  to  do  so,  it  would  be  interesting  to  develop  an  online  version  of  the  alter¬ 
native  greedy  approximation  algorithm  that  performed  optimally  on  such  examples. 

2.6  Conclusions 

This  chapter  considered  an  online  resource  allocation  problem  that  generalizes  several 
previously-studied  online  problems,  and  that  has  applications  to  algorithm  portfolio  de¬ 
sign  and  the  optimization  of  query  processing  in  databases.  The  main  contribution  of  this 
chapter  was  an  online  version  of  a  greedy  approximation  algorithm  whose  worst-case  per¬ 
formance  guarantees  in  the  offline  setting  are  the  best  possible  assuming  P  f  NP.  In 
the  next  chapter  we  evaluate  the  online  greedy  algorithm  experimentally  by  using  it  to 
combine  multiple  problem-solving  heuristics  in  an  online  setting. 
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Chapter  3 

Combining  Multiple  Heuristics  Online 


3.1  Introduction 

In  this  chapter  we  present  black-box  techniques  that  can  be  used  to  combine  multiple 
problem-solving  heuristics  into  a  new  heuristic  with  (potentially)  improved  average-case 
running  time.  In  our  model,  a  user  is  given  a  set  of  heuristics  whose  only  observable  behav¬ 
ior  is  their  running  time.  Each  heuristic  can  compute  a  solution  to  any  problem  instance, 
but  its  running  time  varies  across  instances.  The  user  solves  each  instance  by  interleaving 
runs  of  the  heuristics  according  to  some  schedule.  If  the  heuristics  are  randomized,  the 
user  may  also  periodically  restart  them  with  a  fresh  random  seed. 

Building  on  the  results  of  chapter  2,  we  present 


1 .  exact  and  approximation  algorithms  for  computing  an  optimal  schedule  offline, 

2.  sample  complexity  bounds  for  learning  a  schedule  from  training  data,  and 

3.  no-regret  algorithms  for  learning  a  schedule  on-the-fly  while  solving  a  sequence  of 
problems. 


In  our  experimental  evaluation,  we  use  data  from  recent  solver  competitions  to  show 
that  these  algorithms  can  be  used  to  create  improved  versions  of  state-of-the-art  solvers  in 
a  number  of  problem  domains. 
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3.1.1  Motivations 


Many  important  computational  problems  seem  unlikely  to  admit  algorithms  with  prov- 
ably  good  worst-case  performance,  yet  must  be  solved  as  a  matter  of  practical  necessity. 
Examples  of  such  problems  include  Boolean  satisfiability,  A.I.  planning,  integer  program¬ 
ming,  and  numerous  scheduling  and  resource  allocation  problems.  In  each  of  these  prob¬ 
lem  domains,  heuristics1  have  been  developed  that  perform  much  better  in  practice  than  a 
worst-case  analysis  would  guarantee,  and  there  is  an  active  research  community  working 
to  develop  improved  heuristics.  Indeed,  entire  conferences  are  devoted  to  the  study  of 
particular  problem  domains  (e.g.,  Boolean  satisfiability,  A.I.  planning),  and  annual  com¬ 
petitions  are  held  in  order  to  assess  the  state  of  the  art  and  to  promote  the  development  of 
better  heuristics. 

A  major  drawback  of  using  heuristics  is  that  the  behavior  of  a  heuristic  on  a  previ¬ 
ously  unseen  problem  instance  can  be  difficult  to  predict  in  advance.  The  running  time  of 
a  heuristic  may  vary  by  orders  of  magnitude  across  seemingly  similar  problem  instances 
or,  if  the  heuristic  is  randomized,  across  multiple  runs  on  a  single  instance  that  use  dif¬ 
ferent  random  seeds  [33,  38].  For  this  reason,  after  running  a  heuristic  unsuccessfully  for 
some  time  one  might  decide  to  suspend  the  execution  of  that  heuristic  and  start  running  a 
different  heuristic  (or  the  same  heuristic  with  a  different  random  seed). 

Previous  work  has  shown  that  combining  multiple  heuristics  into  a  portfolio  can  dra¬ 
matically  improve  average-case  running  time  [33,  38].  Table  3.1  illustrates  a  situation  in 
which  this  is  the  case.  The  table  shows  the  behavior  of  the  top  two  solvers  from  the  in¬ 
dustrial  track  of  the  2007  SAT  competition  on  three  of  the  competition  benchmarks.  On 
these  three  instances,  simply  running  the  two  solvers  in  parallel  (e.g.,  at  equal  strength  on 
a  single  processor)  would  reduce  the  average-case  running  time  by  orders  of  magnitude. 


Table  3.1:  Behavior  of  two  solvers  on  instances  from  the  2007  SAT  competition. 


Instance 

Rsat 

picosat 

CPU  (s) 

CPU  (s) 

industrial/anbulagan/medium-sat/dated-10-13-s.cnf 

45 

28 

industrial/babic/dspam/dspam_dump_vc  108 1  .cnf 

3 

>  10000 

industrial/grieu/vmpc 3 1  .cnf 

>  10000 

238 

'A  heuristic  is  simply  an  algorithm.  We  use  the  term  “heuristic”  only  to  suggest  worst-case  exponential 
running  time. 
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If  the  heuristics  are  randomized,  additional  performance  improvements  may  be  achieved 
by  periodically  restarting  them  with  a  fresh  random  seed.  In  particular,  solvers  based  on 
chronological  backtracking  often  exhibit  heavy-tailed  run  length  distributions,  and  restarts 
can  yield  order-of-magnitude  improvements  in  performance  [34,  35]. 


satz-rand  running  on  logistics.d  (length  14) 


satz-rand  running  on  logistics.d  (length  13) 
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Figure3.1:  Run  length  distribution  of  satz-rand  on  two  formulae  created  by  SATPLAN 
in  solving  the  logistics  planning  instance  logistics.d.  Each  curve  was  estimated 
using  150  independent  runs,  and  run  lengths  were  capped  at  1000  seconds. 

Figure  1  shows  the  run  length  distribution  of  the  SAT  solver  satz-rand  on  two 
Boolean  formulae  created  by  running  a  state-of-the-art  planning  algorithm,  SATPLAN, 
on  a  logistics  planning  benchmark.  To  find  a  provably  minimum- length  plan,  SATPLAN 
creates  a  sequence  of  Boolean  formulae  (eg,  a2,  ■  ■  .),  where  a,  is  satisfiable  if  and  only 
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if  there  exists  a  feasible  plan  of  length  <  i.  In  this  case  the  minimum  plan  length  is 
14.  When  run  on  the  (satisfiable)  formula  au,  sat z -rand  exhibits  a  heavy-tailed  run 
length  distribution.  There  is  about  a  20%  chance  of  solving  the  problem  after  running 
for  2  seconds,  but  also  a  20%  chance  that  a  run  will  not  terminate  after  having  run  for 
1000  seconds.  By  restarting  the  solver  every  2  seconds  until  it  yields  a  solution,  one  can 
reduce  the  expected  time  required  to  find  a  solution  by  more  than  an  order  of  magnitude. 
In  contrast,  when  satz-rand  is  run  on  the  (unsatisfiable)  instance  cr13,  over  99%  of  the 
runs  take  at  least  19  seconds,  so  the  same  restart  policy  would  be  ineffective.  Restarts  are 
still  beneficial  on  this  instance,  however;  restarting  every  45  seconds  reduces  the  mean 
run  length  by  at  least  a  factor  of  1.5.  Of  course,  when  using  a  randomized  SAT  solver 
to  solve  a  given  formula  for  the  first  time  one  does  not  know  its  run  length  distribution 
on  that  formula,  and  thus  one  must  select  a  restart  schedule  based  on  experience  with 
previously-solved  formulae. 


3.1.2  Formal  setup 

In  our  model,  we  are  given  a  set  H  of  heuristics,  with  \H\  =  k,  and  a  set  X  of  instances 
of  some  decision  problem.  Heuristic  h,  when  run  on  instance  x,  runs  for  T  (h.  x )  time 
units  before  returning  a  (provably  correct)  “yes”  or  “no”  answer.  In  general,  the  heuristic 
h  will  be  randomized,  and  T  (h,  x)  will  be  a  random  variable  whose  outcome  depends  on 
the  sequence  of  random  bits  supplied  as  input  to  h. 

As  in  Chapter  2,  we  consider  schedules  that  are  sequences  of  actions  of  the  form 
( h,T )  6  W  x  M>0.  We  use  S  to  denote  the  set  of  schedules.  In  our  setting,  an  action 
( h ,  r)  represents  running  heuristic  h  for  (additional)  time  r.  We  require  that  each  heuristic 
h  e  H  be  executed  in  one  of  two  models:  the  suspend-and-resume  model  or  the  restart 
model  (the  choice  of  model  need  not  be  the  same  for  all  h  €  H). 

•  If  /z  is  executed  in  the  suspend-and-resume  model,  then  an  action  {h.  r)  represents 
continuing  a  run  of  heuristic  h  for  an  additional  r  time  units.  When  the  action  is 
completed,  the  run  of  h  is  temporarily  suspended  and  kept  resident  in  memory,  to  be 
potentially  resumed  by  a  later  action. 

•  If  /z  is  executed  in  the  restart  model,  then  an  action  (h.  r)  represents  running  h  from 
scratch  for  time  r,  and  then  deleting  the  run  from  memory.  If  h  is  randomized,  the 
run  is  performed  with  a  fresh  random  seed. 

As  an  example,  suppose  that  h\  is  executed  in  the  suspend-and-resume  model  and  h2 
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is  executed  in  the  restart  model.  Then  the  schedule 


S'  =  ((/i1,5),(/i2,5),(/i1,10),(/i2,10)) 


is  interpreted  as  follows:  “run  h \  for  5  time  units;  then  suspend  the  run  of  hi  and  run  h2 
for  5  time  units;  then  discard  the  run  of  h2  and  continue  the  run  of  h \  for  an  additional  10 
time  units;  then  suspend  the  run  of  h\  and  run  h2  from  scratch  (with  a  fresh  random  seed) 
for  10  time  units.” 

This  class  of  schedules  includes  restart-schedules  [61]  and  task-switching  schedules 
[73]  as  special  cases. 

•  A  restart  schedule  is  a  schedule  for  a  set  hi  that  contains  a  single  heuristic,  executed 
in  the  restart  model.  A  restart  schedule  can  be  written  more  concisely  as  a  sequence 
S  =  (ri,  t2,  . . .)  of  positive  integers,  whose  meaning  is  “run  h  for  T\  time  units;  if 
this  does  not  yield  a  solution  then  restart  h  with  a  fresh  sequence  of  random  bits  and 
run  it  for  r2  time  units,  . . .  ”.  When  executing  a  restart  schedule,  only  a  single  run 
needs  to  be  kept  in  memory. 

•  A  task- switching  schedule  is  a  schedule  that  runs  all  heuristics  in  the  suspend-and- 
resume  model.  If  all  heuristics  in  hi  are  deterministic,  then  the  optimal  schedule 
must  be  a  task-switching  schedule  (assuming  there  is  no  overhead  associated  with 
keeping  multiple  runs  in  memory).  When  executing  a  task-switching  schedule,  up 
to  k  runs  need  to  be  kept  in  memory  simultaneously. 

We  use  Srs  and  Sts  to  denote  the  set  of  restart  schedules  and  the  set  of  task-switching 
schedules,  respectively. 

We  measure  the  performance  of  a  schedule  S'  on  a  problem  instance  x  in  terms  of  the 
expected  time  required  to  solve  x  using  S  (where  the  expectation  is  over  the  random  bits 
used  in  the  runs  that  S  performs).  For  any  schedule  S,  let  px(S)  denote  the  probability 
that  performing  the  sequence  of  actions  in  S  yields  a  solution  to  x.  For  example,  if  S  = 
((hi,  Ti),  (h2,  r2), . . . ,  (hL,  Tfj),  and  all  heuristics  are  executed  in  the  restart  model,  then 

L 

Px(S)  =  1  -  IJ  P  [T  (hh  x )  >  tj]  . 

i=i 

In  §3.3.1  we  formally  define  px(S )  in  the  general  case,  when  one  or  more  heuristics  may 
be  executed  in  the  suspend-and-resume  model. 
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Overloading  notation,  let  T  ( S ,  x )  denote  the  time  required  to  solve  instance  x  using 
schedule  S.  For  any  non-negative  random  variable  X,  we  have  E  [A"]  =  /(=0  P  [X  >  t]  dt. 
Thus  the  expected  time  required  to  solve  x  using  S  can  be  written  as 

/»oo 

E  [T(S,x)\  =  1  -px(S(t))  dt  =  c(px,S)  (3.1) 

Jt= o 

where,  as  in  chapter  2,  S(t)  is  the  schedule  that  results  from  truncating  schedule  S  at  time  t. 
For  example  if  S  —  ((hi,  3),  (h2,  3))  then  S( 5)  =  {(hi,  3),  (h2,  2))  (for  a  formal  definition 
of  S'(t),  see  §2.1.1). 

In  the  special  case  when  all  heuristics  are  executed  in  the  restart  model,  it  can  be 
shown  that  px(S)  satisfies  the  conditions  required  of  a  job,  as  defined  in  Chapter  2.  How¬ 
ever,  when  one  or  more  heuristics  are  executed  in  the  suspend-and-resume  model,  px  is 
no  longer  submodular.  Nevertheless,  we  show  in  §3.1.4  that  the  sufficient  conditions  de¬ 
scribed  in  Chapter  2  are  satisfied  even  when  some  heuristics  are  executed  in  the  suspend- 
and-resume  model,  so  the  results  from  Chapter  2  still  apply. 

By  definition,  the  right  hand  side  of  (3.1)  equals  the  cost  c  (px,  S )  of  the  schedule  S 
for  the  job  px.  This  implies  that  the  offline  and  online  greedy  approximation  algorithms 
presented  in  Chapter  2  can  be  used  to  select  schedules  of  the  form  considered  in  this 
chapter,  so  as  to  minimize  E  [ T  ( S ,  x)]. 

As  in  Chapter  2,  we  use  £  ( S )  to  denote  the  sum  of  the  durations  of  the  actions  in  S. 
For  example  if  S  —  ((iq,  3),  (v2,  3)),  then  £(S)  =  6. 

3.1.3  Summary  of  results 

We  first  consider  the  schedule  selection  problem  in  an  offline  setting.  In  the  offline  setting 
we  are  given  a  set  of  instances  X,  and  are  given  as  input  the  distribution  of  T  (. h ,  x)  for 
all  h  €  H  and  x  6  X.  Our  goal  is  to  compute  the  schedule  with  minimum  total  expected 
running  time  over  the  instances  in  X,  namely  S*  =  argmin5&s  Yhx&x  ®  IT  (T  #)].  In 
this  setting,  the  greedy  algorithm  for  Min-Sum  Submodular  Cover  from  Chapter  2 
gives  a  4-approximation  to  the  optimal  schedule.  We  also  show  that,  even  in  the  special 
case  where  all  heuristics  are  deterministic,  this  offline  problem  generalizes  Min-Sum  Set 
Cover  [26],  implying  that  for  any  e  >  0,  computing  an  4  —  e  approximation  is  NP-hard. 
We  also  give  exact  and  approximation  algorithms  based  on  shortest  path  computations, 
whose  running  time  is  exponential  as  a  function  of  k  (where  k  =  \Tt\)  but  is  polynomial 
for  any  fixed  k. 

We  next  consider  a  learning-theoretic  setting  in  which  we  draw  training  instances  in¬ 
dependently  at  random  from  a  distribution,  compute  an  optimal  schedule  for  the  training 
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instances,  and  then  use  that  schedule  to  solve  additional  test  instances  drawn  from  the 
same  distribution.  In  this  setting,  we  give  bounds  on  the  number  of  instances  required  to 
learn  a  schedule  that  is  probably  approximately  correct  [86]. 

We  then  consider  an  online  setting  in  which  we  are  fed  a  sequence  X  =  (aq,  x2, . . . ,  xn) 
of  problem  instances  one  at  a  time  and  must  obtain  a  solution  to  each  instance  (via  some 
schedule)  before  moving  on  to  the  next  instance.  When  selecting  a  schedule  St  to  use  to 
solve  instance  Xi,  we  have  knowledge  of  the  previous  instances  xi,x2,  •  •  • ,  Xj_ i  but  we 
have  no  knowledge  of  xt  itself  or  of  any  subsequent  instances.  In  this  setting,  the  online 
greedy  algorithm  for  Min-Sum  Submodular  Cover  (from  Chapter  2)  converges  to  a 
4-approximation  to  the  best  schedule,  and  requires  decision-making  time  polynomial  in 
k.  We  also  present  online  shortest  paths  algorithms  that  converge  to  an  ^-approximation 
to  the  best  schedule  (for  some  desired  a  >  1),  but  which  requires  decision-making  time 
exponential  in  k. 

We  then  discuss  how  our  results  in  these  three  settings  can  be  extended  in  two  ways. 
First,  we  show  that  our  algorithms  can  be  applied  in  an  interesting  way  to  heuristics  for 
optimization  rather  than  decision  problems.  Second,  we  discuss  how  quickly-computable 
features  of  problem  instances  can  be  used  to  improve  the  schedule  selection  process. 

Experimentally,  we  use  data  from  recent  solver  competitions  to  show  that  task-switching 
schedules  computed  by  our  greedy  approximation  algorithm  can  be  used  to  improve  the 
performance  of  state-of-the-art  solvers  in  several  problem  domains.  Our  experimental 
evaluation  considers  both  optimization  and  decision  problems,  and  makes  use  of  instance 
features  to  improve  the  schedule  selection  process.  We  also  show  that  that  the  greedy 
approximation  algorithm  can  be  used  to  construct  a  restart  schedule  that  improves  the 
performance  of  a  randomized  SAT  solver  on  a  set  of  logistics  planning  benchmarks. 

The  results  in  this  chapter  are  based  in  part  on  two  conference  papers  [78,  79]. 


3.1.4  Relationship  to  the  framework  of  Chapter  2 

Before  moving  on,  we  show  how  the  problem  considered  in  this  chapter  can  be  put  into 
the  framework  of  Chapter  2.  As  already  mentioned,  this  fact  allows  us  to  apply  the  offline 
greedy  approximation  algorithm  from  Chapter  2,  as  well  as  its  online  counterpart,  to  the 
problem  considered  in  this  chapter. 

As  mentioned  in  §3.1.2,  given  an  instance  x,  we  will  be  interested  in  selecting  a  sched¬ 
ule  S  so  as  to  minimize  the  expected  running  time  c(px,S)  =  j'=0  1  —  px(S)  dt  = 
E  [T  (S,  x)].  In  the  online  setting,  a  sequence  (x1,  x2, . . . ,  xn)  of  instances  arrive  online. 
The  following  lemma  shows  that  the  sequence  (pXl,pX2i  ■  ■  ■  ■  Pxn )  satisfies  Condition  2 
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(defined  in  §2.1.2).  As  we  discuss  further  in  §3.6.3,  this  implies  that  the  online  greedy 
algorithm  from  Chapter  2  can  be  used  to  select  schedules  in  such  a  way  that  the  average 
running  time  of  the  schedules  selected  by  the  online  algorithm  is  (asymptotically)  at  most 
4  times  that  of  the  optimal  fixed  schedule  for  the  given  sequence  of  instances. 

Lemma  6.  Let  (xi,x2,  •  •  •  ,xn)  be  a  sequence  of  n  problem  instances,  and  let  pi  =  pXi. 
Then  the  sequence  (p i ,  p-> . . . .  ,pn)  satisfies  Condition  2  from  §2.7.2.  That  is,  for  any  se¬ 
quence  ,S'| ,  ,5 2,  . . . ,  Sn  of  schedules  and  any  schedule  S, 

E"=iPi(5i© s)  ~Pi(si)  ^  __  \Y!t=iPi{si®  (Kr))) -Pi(Si) 

l  (S)  (v,r)eVxR>0  {  T 


Proof  Any  schedule  S  can  be  rewritten  as  S'  =  (a\,  a2, . . . ,  af),  where  each  action  ai  runs 
a  different  heuristic,  without  changing  the  value  of  p(S)  (i.e.,  multiple  runs  of  a  heuristic 
executed  in  the  suspend-and-resume  model  can  be  consolidated  into  a  single  run).  Fix 
some  instance  xt,  and  let  p'fiS)  =  p-fS,  ©  S)  —  p,(S,).  Thus,  p'fS)  is  the  probability  that 
at  least  one  action  in  S  solves  x%  after  all  actions  in  S%  have  failed  to  solve  xt.  Using  the 
union  bound,  we  have 

L 

p'i(S)  <  • 

1=1 

Let  ai  =  ( hi,ri ),  and  let  r  be  the  maximum  in  the  right  hand  side  of  (3.2).  Summing  the 
inequality  over  all  x  G  X  yields 

n  L  n  L 

Y,Pi(Si  ©  S)  <  ©  (ai))  <  r  •  Ti  =  r  •  i  (S) 

i= 1  1=1  i=  1  1=1 

which  proves  the  lemma.  □ 


In  the  offline  setting,  we  are  given  a  set  of  instances  X  =  {aq,  x2, ...  ,xn},  and  wish 
to  select  a  schedule  S  that  minimizes  ^  X"=i  ®  xi)}-  Using  equation  (3.1),  this  is 

the  same  as  minimizing  £  X'U  c  (p*4,  S)  =  c  (p,  S ),  where  p(S)  =  ±  P*i ( s )•  The 
following  corollary  of  Lemma  6  shows  that  the  function  p  satisfies  Condition  1  (defined 
in  §2.1.2).  As  we  discuss  further  in  §3.4.2,  this  implies  that  the  greedy  approximation 
algorithm  from  Chapter  2  can  be  used  to  obtain  a  4-approximation  to  the  optimal  schedule. 

Corollary  1.  Let  X  =  {xi,X2,  ■  ■  ■  ,a;n}  be  a  set  of  problem  instances,  and  let  p(S)  = 
n  SILi Pxi(S).  Then  p  satisfies  Condition  1  from  §2.7.2.  That  is,  for  any  Si,  S  G  S,  we 
have 

p(Si  ©  S)  -  p(S1)  f  p(S i  ©  ((v,  r)))  -  p(S i) 

£(S)  (d,t)6Vxi>0  [  r 
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3.2  Related  Work 


In  this  section  we  discuss  two  areas  of  related  work:  algorithm  portfolios  and  restart 
schedules. 


3.2.1  Algorithm  portfolios 

The  work  presented  in  this  chapter  is  closely  related  to,  and  shares  the  same  goals  as, 
previous  work  on  algorithm  portfolios  [33,  38].  An  algorithm  portfolio  is  a  schedule  for 
combining  runs  of  various  heuristics.  The  schedules  considered  in  the  original  papers  on 
algorithm  portfolios  simply  run  each  heuristic  in  parallel  at  equal  strength  and  assign  each 
heuristic  a  fixed  restart  threshold.  Gomes  et  al.  [33]  addressed  the  problem  of  constructing 
an  optimal  algorithm  portfolio  offline  given  knowledge  of  the  run  length  distribution  of 
each  algorithm,  under  the  assumption  that  each  algorithm  has  the  same  run  length  distri¬ 
bution  on  all  problem  instances.  Earlier  work  [43]  considered  the  problem  of  devising  a 
schedule  for  combining  multiple  heuristics  that  achieves  an  optimal  competitive  ratio  on  a 
single  problem  instance. 

A  recent  paper  by  Sayag  et  al.  [73]  considered  the  problems  of  selecting  task-switching 
schedules  and  resource- sharing  schedules  for  multiple  heuristics,  both  in  the  offline  and 
learning-theoretic  settings.  A  resource-sharing  schedule  S  :  hi  — >  [0, 1]  specifies  that  all 
heuristics  in  7 i  are  to  be  run  in  parallel,  with  each  h  £  hi  receiving  a  proportion  S(h ) 
of  the  CPU  time.  The  primary  contribution  of  their  paper  was  an  offline  algorithm  that 
computes  an  optimal  resource-sharing  schedule  in  0(nk  '  )  time.  They  also  discuss  an 
0  (nk+ 1 )  algorithm  for  computing  optimal  task-switching  schedules  offline.  As  proved  by 
Sayag  et  al.  (Lemma  1  of  [73]),  an  optimal  task-switching  schedule  always  performs  as 
well  or  better  than  an  optimal  resource-sharing  schedule. 

Independently,  Petrik  [67]  and  Petrik  and  Zilberstein  [68]  gave  exact  and  approxi¬ 
mation  algorithms  for  computing  optimal  task-switching  schedules  and  optimal  resource¬ 
sharing  schedules.  Their  algorithms  are  based  on  dynamic  programming,  and  the  running 
time  is  exponential  in  k. 

The  term  algorithm  portfolio  has  also  been  used  to  describe  approaches  that  use  fea¬ 
tures  of  instances  to  attempt  to  predict  which  algorithm  will  run  the  fastest  on  a  given 
instance,  and  then  simply  run  that  algorithm  exclusively  [59,  90].  Note  that  in  this  ap¬ 
proach  there  is  no  notion  of  a  schedule  per  se.  As  already  mentioned,  we  show  how 
instance-specific  features  can  be  incorporated  into  our  framework  later  in  this  chapter. 

The  works  just  described  consider  the  problem  of  learning  an  algorithm  portfolio  from 
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training  data.  Recently,  Gagliolo  and  Schmidhuber  [30]  presented  an  algorithm  that,  like 
our  online  algorithms,  can  be  used  to  select  algorithm  portfolios  on-the-fly  while  solving 
a  sequence  of  problem  instances.  Their  algorithm  produces  resource-sharing  schedules, 
and  uses  one  of  a  number  of  rules  to  select  resource-sharing  schedules  based  on  statistical 
models  of  the  behavior  of  the  heuristics. 


3.2.2  Restart  schedules 

There  has  been  a  considerable  amount  of  work  on  devising  restart  schedules  for  Las  Vegas 
algorithms.  Letting  A  denote  an  arbitrary  Las  Vegas  algorithm,  such  a  schedule  can  be 
represented  as  a  sequence  (ti,  t2, . . .)  of  positive  real  numbers,  whose  meaning  is  “run  A 
for  t\  time  units;  if  this  does  not  yield  a  solution  then  restart  and  run  for  t2  time  units, . . .  ”. 

In  the  early  1990s,  at  least  two  papers  studied  the  problem  of  selecting  a  restart  sched¬ 
ule  to  use  in  solving  a  single  problem  instance,  given  no  prior  knowledge  of  the  algorithm’s 
run  length  distribution  on  that  instance.  The  main  results  are  summarized  in  the  following 
theorem  proved  by  Luby  et  al.  [61]. 

Theorem  16  (Luby  et  al.  ,  1993). 

1.  For  any  instance  x,  the  schedule  S  that  minimizes  E  [T  ( S ,  a;)]  is  a  uniform  restart 
schedule  of  the  form  (t*,  t t*). 

2.  Let  £  =  minse(sr3  E  [T  ( S ,  x)].  The  universal  restart  schedule 2 

Sumv  =  (1, 1,  2, 1, 1,  2, 4, 1, 1,  2, 1, 1, 2, 4,  8, . . .) 
has  E  [T  (Suniv,x)]  =  0(£log£). 

3.  For  any  schedule  S,  it  is  possible  to  define  a  distribution  of  T  (A.  x)  such  that 
E  [T  (S',  x)]  >  |flog2f  (i.e.,  the  worst-case  performance  of  Suniv  is  optimal  to 
within  constant  factors). 

Luby  [61]  also  showed  that  other  classes  of  restart  schedules,  for  example  suspend- 
and-resume  and  probabilistic  schedules,  are  no  more  powerful  than  ordinary  restart  sched¬ 
ules  (assuming  the  restart  schedule’s  performance  is  measured  on  a  single  problem  in¬ 
stance).  Alt  et  al.  [3]  gave  related  results,  with  a  focus  on  minimizing  tail  probabilities 
rather  than  expected  running  time. 

2The  universal  schedule  can  be  described  as  follows.  All  run  lengths  are  powers  of  two,  and  as  soon  as 
two  runs  of  the  same  length  have  been  completed,  a  run  of  twice  that  length  is  immediately  performed. 
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In  the  late  1990s,  there  was  a  renewed  interest  in  restart  schedules  due  to  a  paper  by 
Gomes  et  al.  [35],  which  demonstrated  that  (then)  state-of-the-art  solvers  for  Boolean  sat¬ 
isfiability  and  constraint  satisfaction  could  be  dramatically  improved  by  randomizing  the 
solver’s  decision-making  heuristics  and  running  the  randomized  solver  with  an  appropri¬ 
ate  restart  schedule.  In  one  experiment,  their  paper  took  a  deterministic  SAT  solver  called 
sat  z  and  created  a  version  called  sat  z-rand  with  a  randomized  heuristic  for  selecting 
which  variable  to  branch  on  at  each  node  in  the  search  tree.  They  found  that,  on  certain 
problem  instances,  sat  z-rand  exhibited  a  heavy-tailed  run  length  distribution.  By  peri¬ 
odically  restarting  it  they  obtained  order-of-magnitude  improvements  in  running  time  over 
both  satz-rand  (without  restarts)  and  satz,  the  original  deterministic  solver.  Their 
paper  also  demonstrated  the  benefit  of  randomization  and  restart  for  a  then  state-of-the-art 
constraint  solver. 

One  limitation  of  Theorem  16  is  that  it  is  “all  or  nothing”:  it  either  assumes  complete 
knowledge  of  the  run  length  distribution  (in  which  case  a  uniform  restart  schedule  is  op¬ 
timal)  or  no  knowledge  at  all  (in  which  case  the  universal  schedule  is  optimal  to  within 
constant  factors).  Several  papers  have  considered  the  case  in  which  partial  but  not  com¬ 
plete  knowledge  of  the  run  length  distribution  is  available.  Ruan  et  al.  [71]  consider  the 
case  in  which  each  run  length  distribution  is  one  of  m  known  distributions,  and  give  a 
dynamic  programming  algorithm  for  computing  an  optimal  restart  schedule.  The  running 
time  of  their  algorithm  is  exponential  in  m,  and  thus  it  is  practical  only  when  m  is  small 
(in  the  paper  the  algorithm  is  described  for  m  —  2).  Kautz  et  al.  [45]  considered  the  case 
in  which,  after  running  for  some  fixed  amount  of  time,  one  observes  a  feature  that  gives 
the  distribution  of  that  run’s  length. 

A  paper  by  Gagliolo  &  Schmidhuber  [31]  considered  the  problem  of  selecting  restart 
schedules  online  in  order  to  solve  a  sequence  of  problem  instances  as  quickly  as  possible. 
Their  paper  treats  the  schedule  selection  problem  as  a  2-armed  bandit  problem,  where  one 
of  the  arms  runs  Luby’s  universal  schedule  and  the  other  arm  runs  a  schedule  designed 
to  exploit  the  empirical  run  length  distribution  of  the  instances  encountered  so  far.  Their 
strategy  was  designed  to  work  well  in  the  case  where  each  instance  has  a  similar  run  length 
distribution. 

3.2.3  Contributions  of  this  chapter 

The  results  in  this  chapter  advance  the  state  of  the  art  in  algorithm  portfolio  design  in  four 
important  ways: 

1.  We  consider  a  powerful  class  of  schedules  that  generalizes  both  task-switching 
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schedules  and  restart  schedules. 


2.  We  provide  polynomial-time  approximation  algorithms  for  computing  schedules  of 
this  form,  and  state  hardness-of-approximation  results  showing  that  no  polynomial¬ 
time  algorithm  can  provide  better  worst-guarantees  assuming  P  f  NP.  Note  that  no 
polynomial-time  approximation  algorithms  were  previously  known  for  computing 
task-switching  schedules  or  for  computing  restart  schedules  (in  the  general  case  of 
multiple  problem  instances  with  heterogeneous  run  length  distributions). 

3.  We  provide  online  algorithms  with  strong  theoretical  guarantees,  which  can  be  used 
to  leam  an  appropriate  schedule  on-the-fly  while  solving  a  sequence  of  problems. 
The  worst-case  performance  of  our  online  algorithms  is  the  same  as  that  of  the  corre¬ 
sponding  offline  approximation  algorithm,  asymptotically  as  the  number  of  problem 
instances  goes  to  infinity. 

4.  We  show  how  features  of  problem  instances  can  be  exploited  by  our  online  al¬ 
gorithms  in  a  natural  way,  thus  unifying  the  benefits  of  two  previous  approaches 
[38,  59]  to  algorithm  portfolio  design. 


3.3  State  Space  Representation  of  Schedules 

In  this  section  we  introduce  a  state-space  representation  of  schedules  that  will  be  used 
extensively  in  later  sections.  We  first  discuss  profiles  and  states,  which  provide  a  canonical 
way  of  representing  the  work  done  by  a  particular  schedule  at  a  particular  time.  We  then 
discuss  how  the  problem  of  computing  an  optimal  schedule  can  be  solved  as  a  shortest 
path  problem  in  a  graph  whose  vertices  are  states.  We  use  this  shortest  path  formulation 
in  §3.4  to  obtain  offline  algorithms,  then  use  it  again  in  §3.5  to  obtain  sample  complexity 
bounds  for  learning  a  schedule  from  training  data.  Lastly,  we  discuss  a-rcgularity,  which 
provides  a  way  to  drastically  reduce  the  size  of  the  state  space  at  the  cost  of  a  constant 
factor  performance  degradation,  leading  in  §3.4  to  offline  approximation  algorithms  and 
in  §3.5  to  additional  sample  complexity  bounds. 


3.3.1  Profiles  and  states 

A  profile  P  =  (ti,  t2,  . . . ,  tl)  is  a  non-increasing  sequence  of  positive  real  numbers.  For 
any  heuristic  h,  we  use  V  (S,  h )  to  denote  a  profile  that  lists,  in  non-decreasing  order,  the 
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total  lengths  each  run  of  h  performed  by  S.  If  h  is  being  executed  in  the  suspend-and- 
resume  model,  then  V  (S,  h )  always  contains  a  single  number.  If  h  is  being  executed  in 
the  restart  model,  then  the  number  of  values  in  V  (S.  h )  equals  the  number  of  actions  in  S 
that  refer  to  h.  The  size  of  a  profile  P  =  (ti,  t2,  . . . ,  rL)  is  defined  to  be  Yld=i  T>- 

As  an  example,  let 


S  =  <(/n,  1),  (*>2,2),  (hu  3),  (h2, 4)). 


If  h\  is  being  executed  in  the  suspend-and-resume  model,  then  V  (S,  hi)  =  (4);  otherwise 
V  (. S ,  hi)  =  (3, 1).  Similarly,  if  h2  is  being  executed  in  the  suspend-and-resume  model, 
then  V  (S,  h2 )  =  (6);  otherwise  V  (. S ,  h2)  =  (4,  2). 

A  state  Y  =  (P, ,  P2, . . . ,  Pfc)  is  a  A;-tuplc  of  profiles,  where  k  —  \H\.  We  define  the 
size  of  a  state  to  be  the  sum  of  the  sizes  of  the  k  profiles  it  contains.  For  any  schedule  S, 
we  define  a  corresponding  state 

y(s)  =  (v  (s,h!)  ,v  (s,h2) , . . .  ,v  (s,hk)) 

where  H  =  {hi,  h2, . . . ,  hk}.  We  use  F®  =  ((),  (),...,())  to  denote  the  empty  state. 

For  any  instance  x,  the  value  of  px(S)  (the  probability  that  performing  the  runs  in  S 
yields  a  solution  to  x)  can  be  written  as  follows.  Let  3;(S')  =  (Pi,  P2, . . . ,  P*,),  where 
pj  =  (pV2V--,t£.>.  Then 


k  Lj 

px(S)  =  l-Hl[F[T(hj,x)>Ti]  .  (3.3) 

j= i  z=i 

The  key  property  of  states  is  given  by  the  following  lemma,  which  follows  immediately 
from  the  definitions. 

Lemma  7.  Let  Si  and  S2  be  schedules  such  that  y(Si)  =  y(S2).  Then 

1.  for  any  instance  x,  px(S i)  =  px{S2),  and 

2.  for  any  schedule  S,  y(S\  ©  S)  —  y(S2  ©  S). 

Proof  The  first  statement  is  immediate  from  (3.3).  To  prove  the  second  statement,  assume 
for  contradication  that  V  (Si  ©  S,  h)  f  V  (S2  ©  S,  h),  for  some  heuristic  h.  First,  suppose 
h  is  executed  in  the  suspend-and-resume  model.  Then  V  (Si,  h)  =  V  (S2,  h )  =  (r),  and 
V  (S,  h)  =  (F).  It  follows  that  V  (Fi  ©  S,  h)  =  V  (S2  ©  S,h)  —  (r  +  r'). 
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Now  suppose  h  is  executed  in  the  restart  model.  Let  P  (Si,  h )  =  P  ( S2 ,  h)  =  (ti, 
t2,  . . . ,  tl),  and  let  P  (S,  h )  =  (r(,  Tj,  . . . ,  r^).  Then  P  (Si  ©  S',  /i)  simply  contains  the 
L  +  M  values  in  V  (Si,  h )  and  P  (S',  /i),  listed  in  non-decreasing  order,  and  the  same  is 
true  of  P  (S2  ©  S',  h).  □ 

3.3.2  State  space  graphs 

A  state  space  graph  is  a  triple  G  =  (V,  E,  Se),  where  (V,  E)  is  a  directed  acyclic  graph 
whose  vertices  are  states,  and  Se  :  E  S  is  a  function  that  labels  each  edge  e  e  E  with 
a  schedule  S'e(e).  We  require  that  a  state-space  graph  satisfy  the  following  conditions. 

1.  V  contains  the  empty  state  F0,  as  well  as  a  distinguished  vertex  v*  (v*  is  not  a  state), 

2.  for  every  state  Y  e  F,  there  is  a  path  from  F  to  v*,  and 

3.  if  the  edges  e\,  e2,...  ,eL  form  a  path  from  F0  to  some  state  F,  then 

37('S'e(el)  ©  >S'e(e2)  ©  •  •  ■  ©  Se(eL ))  =  ^  • 

We  use  Sg  to  denote  the  set  of  schedules  that  correspond  to  paths  from  F0  to  v*  in  G.  In 
other  words,  S  G  Sq  if  any  only  if  there  exists  a  path  e\,  e2, . . . ,  from  Yv>  to  v*  such 

that  S  =  Se(e1)  ©  Se(e2)  ©  ...  ©  Se(eL ). 

We  now  describe  how  to  assign  weights  to  the  edges  of  a  state-space  graph  in  such  a 
way  that  an  optimal  schedule  within  the  set  So  can  be  found  by  computing  a  shortest  path 
from  F0  to  v*.  Let  e  =  (Fi,  F2)  be  a  directed  edge  in  the  state  space  graph.  Let  Si  be  any 
schedule  such  that  y(Si)  =  Yi,  and  let  S  =  Se(e).  For  any  instance  x,  define  the  weight 

w(e,  x )  as 

rUSi  ®S) 

w(e,x)=  1  -Px((Si®S),t))  dt .  (3.4) 

Jt=e(s1)  v  J 

Note  that  by  Lemma  7,  the  value  of  the  right  hand  side  does  not  depend  on  the  choice  of 

Si. 

Let  the  edges  ei,  e2, . .. . ,  eL  form  a  path  from  F0  to  v*  in  the  state  space  graph.  Let 
S  =  Se(e  1)  ©  Se(e2)  ©  ...  ©  Se(eL).  By  construction,  the  weight  of  the  path  is  exactly 

L  i{S) 

E  w(e*’  x)=  1  -  Px(S(t >)  dt  =  E  [min  {£  ( S ) ,  T  ( S ,  t)}]  . 

i=\  J t=0 

Thus  we  have  the  following  lemma. 
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Lemma  8.  Given  a  state  space  graph  G  and  a  set  of  instances  X,  the  schedule 


S*  =  arg  min  V  E  [min  {£  ( S ) ,  T  ( S ,  x)}] 

ttx 

may  be  found  by  computing  a  shortest  path  from  the  empty  state  Y®  to  v*  in  G,  where  each 
edge  e  is  assigned  weight  w(e,  x),  where  w(e,x)  is  defined  in  equation  (3.4). 

Note  that  this  lemma  provides  a  way  to  optimize  E  [min  {l  (S) ,  T  (S,  x)}],  whereas  in 
general  we  will  be  interested  in  optimizing  E  [ T  (S,  x)].  When  making  use  of  this  lemma, 
we  will  set  up  our  state  space  graph  G  so  that  T  ( S ,  x)  <  (:  ( S )  for  all  x  E  X  and  for  all 
S  E  Sg,  either  with  certainty  or  with  high  probability. 

3.3.3  ct-Regularity 

In  this  section  we  discuss  o-rcgularity,  which  provides  a  way  to  reduce  the  number  of 
states  that  must  be  considered  when  using  the  shortest  path  formulation  described  in  the 
previous  section,  at  the  cost  of  a  constant  factor  performance  degradation. 


Special  case:  suspend-and-resume  only 

In  the  special  case  in  which  all  heuristics  are  executed  in  the  suspend-and-resume  model, 
the  definition  of  an  a-rcgular  schedule  is  simple:  a  schedule  S  is  a-rcgular  if,  whenever 
S  stops  running  a  heuristic  h  and  starts  running  a  different  heuristic  instead,  the  total  time 
invested  so  far  in  h  is  a  power  of  a  (i.e.,  it  equals  a1  for  some  integer  i).  For  example,  the 
schedule 

S  =  ((hi,  2),  (h2, 1),  (hi,  14),  (h2,  3)) 

is  2-regular.  However,  the  schedule  S  =  {(hi,  14),  (h2, 1),  (hi,  2),  (h2,  3))  is  not. 

When  all  heuristics  are  executed  in  the  suspend-and-resume  model,  it  is  not  difficult  to 
show  that  for  any  schedule  S,  there  is  an  o-rcgular  schedule  Sa  whose  expected  running 
time  on  any  instance  is  at  most  a  times  that  of  S  (see  Lemma  9). 


General  case 

In  the  general  case,  the  definition  of  a  -rcgularity  takes  into  account  both  the  lengths  of  the 
runs  of  each  heuristic,  as  well  as  the  number  of  runs  of  each  particular  length  that  have 
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been  performed.  Additionally,  the  performance  overhead  increases  from  a  factor  of  a  to  a 
factor  of  a2. 

For  any  a  >  1,  we  say  that  a  profile  P  =  (ti,  t2,  . . . ,  tl)  is  o-regular  if,  for  all  l 

(1  <  l  <  L ), 

1.  Ti  is  a  power  of  a  (i.e.,  77  =  a1  for  some  integer  i ),  and 

2.  the  number  of  occurrences  of  the  value  Ti  in  P  is  the  floor  of  a  power  of  a  (i.e., 
\{l'  :  tv  —  Ti}  |  =  [a1  J  for  some  integer  i). 

For  example,  the  profiles  (4, 1, 1)  and  (4,  2, 1, 1, 1, 1)  are  2-regular,  but  the  profile  (3, 1) 
is  not  (because  3  is  not  a  power  of  2),  and  the  profile  (2, 1, 1, 1)  is  not  (because  there  are 
three  runs  of  length  1). 

A  state  Y  =  (P1,  P2, . . . ,  Pk)  is  a-regular  if  each  profile  Pj  is  a-regular.  A  schedule  S 
is  a: -regular  if  it  can  be  written  as  S  =  Si  ©  S2  ©  ...  ©  Sl,  where 

1.  for  any  /,  all  actions  in  Si  are  identical 

2.  for  any  /,  y(Si  ©  £2  ©  ...  ©  Si)  is  an  a-regular  state. 

The  following  example  illustrates  these  definitions. 

Example  4.  Let  aT  =  (h,  r).  Then  the  geometric  restart  schedule  S  =  (a±,  a2,  a4,  a8)  is 
2-regular,  as  can  be  seen  by  writing  it  as  S  =  (ai)  ©  (a2)  ©  (a4)  ©  (a8).  However,  the 
restart  schedule 

S  =  (ai,  a2,  ai,  a2,  a4,  a2,  a4,  a2, . . .) 

is  not  2-regular.  This  is  because,  if  we  write  the  schedule  as  S  =  (o4)  ©  (a2)  ©  (a4)  © 
(a2)  ©  . . .,  then  the  state 

3^((«i)  ©  (a2)  ©  (01)  ©  (a2)  ©  (a4)  ©  (a2))  =  ((2,  2,  2, 1, 1, 1)) 

is  not  2-regular.  However,  by  permuting  the  order  of  the  runs  in  S,  we  can  obtain  a 
schedule 


S'  —  (a4,  a2,  a4,  a2,  a4,  a4,  a2,  a2,  a4,  a4,  a4,  a4,  a2,  a2,  a2,  a2, . . .) 

that  is  again  2-regular. 
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The  key  property  of  a: -regular  schedules  is  given  by  the  following  lemma,  which  shows 
that  an  arbitrary  schedule  can  be  “rounded  up”  to  an  a -regular  schedule  while  introducing 
at  most  a  factor  a 2  performance  overhead  (in  the  special  case  where  all  heuristics  are 
executed  in  the  suspend- and-resume  model,  there  is  only  a  factor  a  overhead).  The  proof 
is  given  in  Appendix  A. 

Lemma  9.  For  any  schedule  S  and  any  a  >  1,  there  exists  an  a-regular  schedule  Sa  such 
that,  for  any  instance  x,  E  [T  (Sa,  x)]  <  a2  ■  E  [T  ( S ,  x)].  In  the  specicd  case  where  all 
heuristics  are  executed  in  the  suspend-and-resume  model,  E  [T  (Sa,  x)]  <  a  ■  E  [T  ( S ,  x)]. 


3.4  Offline  Algorithms 

In  the  offline  setting  we  are  given  as  input  a  set  of  instances  X  =  (xi,  x2, . . xn},  and 
are  given  the  distribution  of  T  (h,  x)  for  all  h  £  Ft  and  x  £  X.  Our  goal  is  to  compute  the 
schedule 

n 

S*  =  argminV^  E  [T  (S,  x*)]  . 

tt 

This  offline  problem  is  of  interest  for  two  reasons.  First,  in  the  learning-theoretic  setting 
one  must  solve  the  offline  problem  in  order  to  compute  an  optimal  schedule  for  the  set  of 
training  instances.  Second,  the  algorithms  we  develop  for  solving  the  offline  problem  will 
serve  as  a  basis  for  our  online  algorithms. 


3.4.1  Computational  complexity 

If  \H\  is  arbitrary,  it  is  NP-hard  to  compute  even  an  approximately  optimal  schedule.  This 
is  true  even  in  the  special  case  in  which  each  heuristic  in  Ft  is  deterministic  (so  that  we 
only  need  to  consider  task-switching  schedules). 

To  see  this,  consider  the  special  case  in  which  all  heuristics  are  deterministic  and, 
for  each  instance  x  £  X  and  each  heuristic  h  £  H,  T  (h,x)  £  {l,oo}.  In  this  case, 
an  optimal  task-switching  schedule  can  be  represented  simply  as  a  permutation  of  the  k 
heuristics  (where  the  permutation  corresponds  to  a  schedule  that  runs  each  heuristic  for 
one  unit  of  time,  in  the  order  specified  by  the  permutation).  If  we  identify  each  heuristic  h 
with  the  set  of  instances  {x  £  X  :  T  (h,  x)  =  1},  then  our  goal  is  to  order  these  sets  from 
left  to  right  so  as  to  minimize  the  sum,  over  all  x  £  X,  of  the  position  of  the  leftmost 
set  that  contains  x.  This  is  exactly  the  Min-Sum  Set  Cover  problem  introduced  by 
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Feige,  Lovasz,  and  Tetali  [26],  who  proved  that  for  any  e  >  0,  achieving  a  4  —  e  ratio  for 
Min-Sum  Set  Cover  is  NP-hard.  Thus  we  have  the  following  theorem. 

Theorem  17.  For  any  e  >  0,  obtaining  a  4  —  e  approximation  to  the  optimal  schedule 
is  NP-hard,  even  in  the  special  case  when  each  heuristic  h  G  Ft  is  deterministic  and 
T  (. h ,  x)  G  {1,  oo}  for  all  h  G  Ft  and  cdl  x  G  X. 

In  light  of  previous  work  on  restart  schedules,  it  is  natural  to  ask  what  happens  in  the 
special  case  when  FI  contains  a  single  randomized  heuristic  h.  If,  additionally,  X  contains 
a  single  problem  instance  (or  if  the  distribution  of  T  (. h ,  x)  is  the  same  for  all  x  G  X) 
then  the  problem  of  computing  an  optimal  schedule  is  trivial,  as  shown  in  Theorem  16. 
However,  in  the  general  case  in  which  T  ( h ,  x)  may  have  a  different  distribution  for  each 
x  G  X,  the  problem  appears  to  be  more  complex.  Although  we  have  not  been  able  to 
determine  whether  the  offline  optimization  problem  is  NP-hard  in  this  special  case,  we  do 
believe  that  designing  an  algorithm  to  solve  it  is  non-trivial.  One  simple  idea  that  does  not 
work  is  to  compute  an  optimal  restart  schedule  for  the  single  distribution  that  results  from 
averaging  the  distribution  of  T  (h,  x)  for  each  x  G  X.  To  see  the  flaw  in  this  idea,  consider 
the  following  example  with  two  instances,  x\  and  x2:  T  (. h ,  x  \ )  equals  1  with  probability 
1  and  equals  1000  with  probability  |,  while  T  (h,x 2)  =  1000  with  probability  1.  The 
optimal  restart  schedule  for  the  averaged  distribution  is  the  uniform  schedule  (1,1,1,...), 
however  this  schedule  never  solves  x2. 


3.4.2  Greedy  approximation  algorithm 


We  first  consider  the  greedy  approximation  algorithm  from  Chapter  2.  As  mentioned  in 
§3.1.4,  equation  (3.1)  shows  that  minimizing  Y^i=i  ®  [T  (<S§  24)]  is  equivalent  to  minimiz¬ 
ing  c  (p,  S),  where p(S)  =  ±  £"=1  pXi(S). 

For  our  purposes,  the  greedy  schedule  G  =  (gi,g2,  ■  ■  •)  can  be  defined  inductively  as 
follows:  G\  =  (),  Gj  =  (01,02,  •  •  ■  ,0j-i)  for  j  >  1,  and 


9j 


arg  max 

(/i,r)£Wxi>o 


p(Gj  0  ((h,r)))  -p(Gj)  } 


(3.5) 


We  may  stop  adding  actions  to  G  once  we  reach  a  j  such  that  p(Gj)  =  1  (or  once  we 
reach  a  j  such  that  p{Gf)  >  1  —  5  for  some  small  5  >  0).  The  time  required  to  compute 
each  action  g3  is  not  prohibitive  in  general.  If  all  heuristics  are  deterministic  and  there  are 
only  n  instances,  then  for  each  heuristic  h  we  need  to  consider  at  most  n  action  durations 
on  each  iteration  of  the  greedy  algorithm  (because  there  are  at  most  n  values  of  r  where 
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p  ( Gj  ©  {(h,  r)))  changes).  Alternatively,  it  is  not  hard  to  show  that  requiring  r  to  be  a 
power  of  some  a  >  1  can  increase  the  cost  by  at  most  a  factor  of  a.  If  we  restrict  our 
attention  to  action  durations  that  are  powers  of  a  between  Trnin  and  Tmax,  then  at  most 
k  ( 1  +  log-, Im3^ )  evaluations  of  p  are  necessary. 

\  u  'rmin  J 

We  showed  in  §3.1.4  that  p  satisfies  conditions  sufficient  for  the  greedy  algorithm’s  ap¬ 
proximation  guarantees  to  apply.  Thus,  we  obtain  the  following  as  a  corollary  of  Theorem 
4. 

Corollary  2. 

£  B  [T  (G,  *)]  <  4  min  £  E  [T  (S,  x)}  . 

x£X  XGlX 

Remark  1.  Corollary  2  and  the  definition  of  G  imply  that,  from  the  point  of  view  of  worst- 
case  approximation  guarantees,  the  suspend-and-resume  model  provides  no  advantage 
over  the  restart  model.  To  see  this,  imagine  that  Tt  contains  two  copies  of  each  underlying 
heuristic:  one  executed  in  the  suspend-and-resume  model,  and  one  executed  in  the  restart 
model.  If  ties  are  broken  appropriately,  the  action  q\  could  use  a  heuristic  executed  in  the 
restart  model  (because  the  choice  of  model  does  not  affect  p((a))  for  any  single  action  a). 
Inductively,  the  entire  schedule  G  could  only  use  heuristics  executed  in  the  restart  model, 
and  still  provide  a  4  approximation.  As  shown  in  §3.4.1,  achieving  a  4  —  e  approximation 
(for  any  e  >  0)  is  NP-hard,  even  when  all  heuristics  are  deterministic,  and  regardless  of 
what  model  the  heuristics  are  executed  in.  Thus,  in  terms  of  the  worst-case  approximation 
ratio,  there  is  no  penalty  associated  with  keeping  only  a  single  run  in  memory  at  a  time. 

Recall  from  Chapter  2  that  the  greedy  schedule  also  (approximately)  maximizes  p{S(T) ) , 
for  certain  values  of  T.  In  particular,  we  obtain  the  following  as  a  corollary  of  Theorem 
3.  The  corollary  shows  that,  in  addition  to  approximately  minimizing  average  expected 
running  time,  the  greedy  schedule  approximately  maximizes  the  expected  fraction  of  in¬ 
stances  solved  in  time  <  T,  for  certain  values  of  T. 

Corollary  3.  Fix  an  integer  L,  and  let  T  =  Tj>  where  gj  =  ( hj ,  Tj).  Then 

P  (G(t))  >  (l  -  ^  max  {p{S(T))}  ■ 

Also  recall  from  Chapter  2  that  we  defined  an  improved  greedy  schedule  G',  which 
has  the  same  approximation  guarantees  as  G  in  terms  of  minimizing  cost.  In  the  special 
case  where  Tt  contains  a  single  (possibly  randomized)  heuristic  and  X  contains  a  single 
instance,  the  greedy  schedule  6"  is  optimal,  as  the  following  theorem  shows.  The  proof  is 
given  in  Appendix  A. 
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Theorem  18.  If  H  contains  a  single  ( randomized )  heuristic  and  X  =  {a:}  contains  a 
single  instance,  then 

E  [ T  (G,  x)]  =  minE  [T  ( S ,  x)]  . 


3.4.3  Shortest  path  algorithms 

Although  the  problem  of  computing  an  optimal  schedule  is  NP-hard  when  k  =  \H\  is 
part  of  the  input,  we  might  hope  to  find  an  algorithm  that  runs  in  polynomial  time  for 
any  fixed  k.  As  shown  in  §3.3.2,  the  optimal  schedule  within  a  restricted  set  of  schedules 
can  be  found  by  computing  a  shortest  path  in  an  appropriate  state-space  graph.  In  this 
section  we  use  this  idea  to  obtain  approximation  algorithms  for  computing  task-switching 
schedules  and  restart  schedules.  For  the  case  of  task-switching  schedules,  similar  (though 
not  identical)  dynamic  programming  algorithms  were  given  independently  by  Petrik  [67] 
and  by  Sayag  et  al.  [73]. 


Task-switching  schedules 

We  first  describe  how  to  compute  (approximately)  optimal  task-switching  schedules.  Re¬ 
call  that  a  task-switching  schedule  is  a  schedule  for  a  set  of  deterministic  heuristics  H. 
where  each  heuristic  is  executed  in  the  suspend-and-resume  model. 

Let  B  be  an  artificial  bound  on  the  amount  of  time  we  are  allowed  to  run  any  heuristic. 
We  require  that,  for  each  instance  there  is  always  some  h  E  H  such  that  T  ( h ,  x )  < 

B.  Also,  we  assume  without  loss  of  generality  that  T  ( h ,  x)  >  1  for  all  x  e  X  and  all 
h  eH. 

Fix  some  desired  approximation  ratio  a  >  1.  Define  a  state-space  graph  GfB  = 
(V,  E,  Se)  inductively  as  follows.  The  vertex  set  contains  the  empty  state  Y®.  Let  Y  = 
((ti),  (t2),  . . . ,  (t/c))  G  V  be  a  state  in  the  vertex  set.  For  each  j  such  that  Tj  <  B . 
the  vertex  contains  the  state  Y'  =  ((ti),  (t2),  . . . ,  (Tj_i),  (rj),  (Tj+ i),  . . . ,  (rfc)),  where 
Tj  =  max  (1,  OTj}.  Additionally,  the  edge  set  contains  the  edge  e  =  (Y,  Y'),  where  Se(e) 
is  the  schedule  containing  the  single  action  ( hj ,  rj  —  Tj).  Finally,  if  t3  >  B  for  all  j ,  then 
there  is  an  edge  from  Y  to  v*,  labeled  with  the  empty  schedule. 

By  construction,  any  o-rcgular  task-switching  schedule  that  runs  each  heuristic  for 
time  at  most  B  corresponds  to  a  path  from  Y®  to  v*  through  this  graph,  and  the  weight  of 
the  path  is  Yhxxx  B  x )  (where  the  weights  are  assigned  according  to  equation  (3.4)). 
Thus,  by  Lemmas  8  and  9,  an  a-approximation  to  the  optimal  task-switching  schedule  can 
be  obtained  by  computing  a  shortest  path  from  Y®  to  v*  in  the  graph  GfB. 
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To  bound  the  time  complexity,  first  note  that  there  are  at  most  2  +  [logQ  If  choices3 
for  each  value  Tj  in  a  state  Y  =  ((ti  ),  (r2), . . . ,  (77,,))  that  appears  in  the  vertex  set,  so 
\V\  <  (2  +  [loga  B])k.  Because  each  vertex  has  at  most  k  outgoing  edges,  \E\  <  k  \V\. 
The  time  required  to  compute  a  shortest  path  is  dominated  by  the  time  required  to  assign 
weights  to  the  edges.  Each  edge  weight  is  a  sum  of  \X\  per- instance  weights,  each  of 
which  can  be  computed  in  time  0(1).  Thus  we  have  proved  the  following  theorem. 

Theorem  19.  For  any  approximation  ratio  a  >  1  and  any  budget  B,  an  a-approximation 
to  the  optimal  task-switching  schedule  for  a  set  of  instances  X  can  be  found  by  computing 
a  shortest  path  in  the  state  space  graph  Gf  B  =  (V,  E.  Se),  where  \  V\  <  (2  +  [log,,  B~\  )k 
and  \E\  <  k  \V\.  The  overall  time  complexity  is  O  (nk( 2  +  [ log,,  If  where  n  =  \X\. 

We  now  consider  the  problem  of  computing  an  optimal  task-switching  schedule.  Given 
a  set  of  instances  X,  an  optimal  task-switching  schedule  can  be  found  by  computing  a 
shortest  path  in  a  state-space  graph  Gx  similar  to  the  one  just  described.  In  the  construction 
just  described,  a  state  whose  jth  profile  is  (rf)  had  an  edge  to  a  state  that  was  identical 
except  that  the  jth  profile  is  (rj),  where  r-  is  the  is  the  next  largest  power  of  a  above  Tj.  In 
Gx,Tj  will  instead  be  the  next  largest  member  of  the  set  {0}U{T  ( hj,x )  :  x  G  X}.  The  set 
Sgx  thus  contains  all  schedules  that  will  only  stop  running  heuristic  hj  if  the  time  invested 
on  hj  so  far  equals  T  (hj,  x)  for  some  instance  x.  Using  an  interchange  argument,  it  can 
be  shown  that  an  optimal  schedule  must  satisfy  this  condition.  This  fact,  combined  with 
the  arguments  leading  up  to  the  statement  of  Theorem  19,  proves  the  following  theorem, 
which  was  proved  independently  by  Sayag  et  al.  [73]. 

Theorem  20.  An  optimal  task-switching  schedule  for  a  set  of  instances  X  can  be  found  by 
computing  a  shortest  path  in  the  state  space  graph  Gx  =  {V,  E,  Se),  where  \V\  <  (n  +  l)k 
and  \E\  <  k  \V\,  where  n  =  \X\.  The  overall  time  complexity  is  O  (nk(n  +  1  jA:). 

Restart  schedules 

Lastly,  we  consider  the  case  where  TL  contains  one  or  more  (randomized)  heuristics,  each 
of  which  is  executed  in  the  restart  model.  In  the  special  case  \TL\  =  1,  the  results  in  this 
section  provide  a  way  to  compute  approximately  optimal  restart  schedules,  however  we 
allow  \TL\  to  be  arbitrary. 

As  in  the  previous  section,  we  assume  T  (h,  x)  >  1  (with  probability  1)  for  all  h  G  Ti. 
and  all  a:  G  X,  and  we  require  an  artificial  bound  B  on  the  to  ted  time  that  any  single 
heuristic  can  be  run  (note  that  this  time  may  be  spread  across  multiple  runs).  When  using 

3For  example,  if  a  =  2  and  B  =  4,  the  possible  choices  are  0,  1,  2,  and  4. 
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randomized  heuristics,  it  may  not  be  possible  to  solve  a  problem  instance  with  certainty 
in  any  finite  time.  We  will  confine  our  attention  to  schedules  that,  for  any  x  e  X,  solve  x 
with  probability  at  least  1  —  5,  for  some  5  €  (0, 1).  Letting  Sg  denote  the  set  of  schedules 
that  have  this  property,  our  goal  is  to  compute  (an  approximation  to)  the  schedule 

S*  =  arg  min  V"  E  [min  Bk,  T  (S,  x)]  . 

Ttx 

(In  other  words,  we  charge  a  schedule  time  Bk  for  instances  it  does  not  solve  after  running 
each  heuristic  for  time  B .) 

In  this  setting,  we  obtain  a  quasi-polynomial  time  approximation  scheme  by  combining 
Lemmas  8  and  9.  Specifically,  we  obtain  an  a2  -approximation  to  the  optimal  restart  sched¬ 
ule  by  computing  a  shortest  path  in  a  state  space  graph  GrasB  whose  vertex  set  contains  all 
the  a -regular  states  in  which  each  heuristic  is  run  for  time  at  most  B.  The  construction  is 
similar  to  the  one  described  in  the  previous  section,  and  is  detailed  in  the  proof  of  The¬ 
orem  21  in  Appendix  A.  One  difference  from  the  construction  in  the  previous  section  is 
that,  to  ensure  that  only  schedules  in  Sg  can  form  a  shortest  path  from  F0  to  v*,  edges  of 
the  form  (Y,v*)  are  assigned  infinite  weight  if  performing  the  runs  in  Y  does  not  solve 
each  instance  with  probability  at  least  1  —  5. 

The  key  to  bounding  the  time  complexity  is  to  show  that  \V\,  which  equals  the  number 
of  a-regular  states  in  which  each  heuristic  has  been  run  for  time  at  most  B,  is  at  most 
f?°(fclog“iog“B),  and  that  \E\  =  O  (logQ  B  |Vj).  We  then  show  that,  using  precomputation, 
edge  weights  can  be  assigned  in  time  O  (n)  per  edge. 

Theorem  21.  Fix  an  approximation  ratio  a  >  1,  a  budget  B,  and  an  error  tolerance 
5  >  0.  Then  an  a2  approximation  to  schedule  S*  may  be  found  by  computing  a  shortest 
path  in  a  state-space  graph  Grf  B  =  ( V,E,Se ),  where  \V\  =  O  (^B0^10^0 gaB)^j  and 
\E |  =  O  (log aB  |Vj).  The  overall  running  time  is  O  (n(logQ  B)B°('klogal°SaB^),  where 
n=  \X\. 


3.5  Generalization  Bounds 

To  apply  the  offline  algorithms  just  discussed,  we  might  collect  a  set  of  problem  instances 
to  use  as  training  data,  compute  an  (approximately)  optimal  schedule  for  the  training  in¬ 
stances,  and  then  use  this  schedule  to  solve  additional  test  instances.  Under  the  assumption 
that  the  training  and  test  instances  are  drawn  (independently)  from  a  fixed  probability  dis¬ 
tribution,  we  would  then  like  to  know  how  much  training  data  is  required  so  that  (with 
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high  probability)  our  schedule  performs  nearly  as  well  on  test  data  as  it  did  on  the  training 
data.  In  our  setting  two  distinct  questions  arise: 

1.  How  many  training  instances  do  we  need? 

2.  How  many  runs  of  each  randomized  heuristic  h  must  we  perform  on  each  training 
instance  x  in  order  to  estimate  the  distribution  of  T  (h,  x )  to  sufficient  accuracy? 

We  deal  with  each  question  separately  in  the  following  subsections. 

In  this  section,  we  will  confine  our  attention  to  schedules  in  the  set  Sq,  for  some  state- 
space  graph  G  =  ( V. ,  E,  Se)  (although  we  will  not  necessarily  find  an  optimal  schedule  for 
the  training  instances  by  computing  a  shortest  path  in  G). 

3.5.1  How  many  instances? 

We  first  consider  how  many  training  instances  are  required  to  learn  a  schedule  that  is 
probably  approximately  correct ,  under  the  assumption  that  the  run  length  distribution  of 
each  heuristic  on  each  of  the  training  instances  is  known  exactly.  In  the  next  section,  we 
show  that,  by  running  each  heuristic  on  each  training  instance  for  a  surprisingly  small 
amount  of  time,  we  can  obtain  estimates  of  the  length  distribution  that  allow  us  to  obtain 
the  same  guarantees  as  if  we  knew  the  run  length  distributions  exactly. 

Let  {xi,  x2,  ■  ■  ■ ,  xm}  be  a  set  of  m  training  instances  drawn  independently  at  random 
from  some  distribution.  For  any  edge  e  G  E  in  the  state-space  graph  G  =  (V,E,Se), 
let  w(e,  x)  be  defined  as  in  equation  (3.4),  and  let  p(e)  =  ^  Y1T=  i  w(e>  xi )  he  the  sample 
mean  value  of  w(e,  x)  over  the  instances  in  the  training  set.  Let  p(e)  =  E  [w(e,  x)]  be  the 
expected  edge  weight  for  a  random  test  instance  x. 

Recall  from  §3.3.2  that  any  schedule  S  G  Sg  corresponds  to  a  path  ei,  e2, . . . ,  eL  from 
y0  to  v*  in  G,  such  that  for  any  instance  x. 


L 

E  [min{f  (S)  ,T  (S,  x)}]  =  W(S,x)  =  ^^w(ehx)  . 

i=i 

For  any  schedule  S,  let  p(S)  =  Y^=i  W (-S',  x*)  be  the  sample  mean  weight  of  the  path 
corresponding  to  S  and  let  fi(S)  =  E  [W (S',  x)]  be  the  expected  value  for  a  random  test 
instance  x. 

The  following  lemma  bounds  errors  in  the  estimates  of  p(S)  in  terms  of  errors  in  the 
estimates  of  p(e). 
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Lemma  10. 


max 

SeSG 


fi(S)  -  KS)  \ 

i(S)  j 


<  max 

e£E 


I  Mg)  ~  Me)  1 1 

^e(e))  J  • 


Proof.  Fix  an  arbitrary  schedule  S  G  <Sg,  and  let  e, ,  e2, . . . ,  e7  be  the  corresponding  path 
from  Y%  to  v*.  Then  fi(S)  =  Y^d=iP,iei)  by  construction,  while  p(S)  =  Y^l=i  Me0  by 
linearity  of  expectation.  Thus,  letting  r  denote  the  maximum  value  of  ,  we  have 

L  L 

I  Hs)  -  Ks)\  <  ^2\fl(ei)  -  Met)  |  <  ■  £(Se(ei))  =  r  ■  £(S) 

1=1  1=1 


so  <  r,  as  claimed. 


□ 


We  now  bound  the  probability  that  there  exists  an  edge  e  such  that  >  e, 

for  some  e  >  0.  For  any  edge  e  e  E,  ji(e)  is  the  average  of  m  independent  identically 
distributed  random  variables,  each  of  which  has  range  [0,  £  (S'e(e))]  and  expected  value 
n(e).  Thus  by  Hoeffding’s  inequality, 


P 


jMg)  ~  Me)  I  > 

HSe(e))  ", 


<2exp(— 2me2)  . 


It  follows  that  for  any  5'  >  0,  m 


P  ^ 
ing 


\r{e)~^(e)\ 

qse(e)) 

theorem. 


>  e 


<  S'.  Setting  5' 


0(4  In  Jj)  training  instances  are  required  to  ensure 
rP  and  applying  the  union  bound  proves  the  follow- 


Theorem  22.  Let  G  =  (V,E,Se)  be  a  state-space  graph.  If  the  number  of  training  in¬ 
stances  m  satisfies  the  inequality 


m  >  mo(e,  S,  G ) 


(3.6) 


then  the  inequality 


max 

SeSG 


MS) -MS) 

£(S) 


<  e 


holds  with  probability  at  least  1  —  <5. 


Fix  some  e  >  0,  and  let  Imax  =  maxS£5f,  ( (S).  Plugging  e  =  G—  into  Theorem 

^  v  7  Lmax 

22  shows  that  O  (- =f -  In  )  training  instances  suffice  so  that,  with  probability  at  least 
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1  —  5,  every  schedule’s  estimated  expected  cost  is  within  e  of  its  true  mean.  In  particular, 
with  probability  at  least  1  —  5,  the  schedule  in  So  that  performs  optimally  on  the  training 
instances  will  have  true  expected  cost  at  most  2e  worse  than  optimal. 

Using  Theorem  22,  we  can  now  obtain  sample  complexity  bounds  for  any  state-space 
graph  of  interest.  In  particular,  putting  Theorem  19  together  with  Theorem  22  yields  the 
following  corollary. 

Corollary  4.  Let  the  function  mo(e,  5,  G)  be  defined  as  in  Theorem  22.  Then 

m0(e,  5,  Gg  a)  =  0  (~2  (ln  f  +  k  loS  loSa  • 

Similarly,  putting  Theorem  21  together  with  Theorem  22  yields  the  following  corol¬ 
lary. 

Corollary  5.  Let  the  function  m0(f,  5,  G  )  be  defined  as  in  Theorem  22.  Then 
m0(e,  5,  G%a)  =  O  ^ln  ^  +  k  log  B  logQ  logQ  B^j  ^  . 

Corollaries  4  and  5  and  Lemma  9  suggest  the  following  procedure:  compute  an  (ap¬ 
proximately)  optimal  schedule  for  the  m  training  instances  (using  any  algorithm  or  heuris¬ 
tic  whatsoever)  then  round  the  schedule  up  to  an  ct -regular  schedule,  where  a  is  chosen 
so  that  Corollary  4  or  Corollary  5  applies  for  some  desired  e  and  5.  The  rounding  step 
prevents  overfitting,  and  introduces  only  a  constant  factor  performance  overhead  (in  par¬ 
ticular,  Lemma  9  shows  that  the  rounding  introduces  at  most  a  factor  a  overhead  if  all 
heuristics  are  executed  in  the  suspend-and-resume  model,  and  at  most  a  factor  a 2  over¬ 
head  otherwise). 

3.5.2  How  many  runs  per  instance? 

We  now  consider  how  many  runs  of  each  heuristic  h  must  be  performed  on  each  training 
instance  x  in  order  to  estimate  the  distribution  of  the  random  variable  T  ( h ,  x)  to  sufficient 
accuracy.  Let  B  denote  the  maximum  time  invested  in  any  single  heuristic  by  any  schedule 
S  G  Sq.  If  h  is  deterministic,  the  distribution  of  T  ( h ,  x)  can  be  determined  by  performing 
a  single  run  (for  our  purposes,  this  run  can  be  performed  with  a  time  limit  of  B).  Thus, 
our  interest  is  in  the  case  when  h  is  randomized.  Surprisingly,  it  will  turn  out  that  we  need 
only  to  run  each  randomized  heuristic  for  time  O  ( B  log  B )  in  order  to  obtain  sufficiently 
accurate  estimates. 
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Given  a  training  instance  x,  we  will  describe  how  to  obtain  a  function  w  :  E  x  X  — > 
M>0  such  that,  for  any  edge  e  in  the  state  space  graph,  E  [w(e,  x)\x]  =  w(e,  x).  Note  that 
because  x  is  chosen  at  random,  this  implies 

E  [w(e,  x)]  =  E  [E  [w(e,  x)  |x]]  =  E  [w(e,  x)]  =  fi{e) 

where  /i(e)  was  defined  in  the  previous  section.  Thus,  for  the  purposes  of  proving  Theorem 
22,  the  estimates  w(e,  x)  are  as  good  as  the  true  values  w(e,  x). 

We  obtain  the  desired  function  w  in  two  steps.  First,  we  obtain  unbiased  estimates  of 
the  failure  probability  associated  with  each  profile  and  heuristic.  We  then  use  the  estimates 
of  failure  probabilities  to  obtain  the  desired  estimates  of  edge  weights. 

For  the  purposes  of  this  section,  we  will  assume  T  ( h ,  x)  >  1  (with  probability  1)  for 
any  heuristic  h  and  any  training  instance  x. 


Estimating  failure  probabilities 

Fix  a  training  instance  x  and  a  heuristic  h.  Define  the  failure  probability  of  a  profile 
P  =  (ti,t2,  . . . ,  Tj)  with  respect  to  heuristic  h  as  the  probability  that  performing  runs  of 
lengths  Ti,  t2,  . . . ,  tl  of  heuristic  h  does  not  yield  a  solution  to  x: 

L 

qh(P)  =  l[F[T(h,x)>Tl}  . 

i=i 

Let  VB  denote  the  set  of  profiles  of  size  <  B,  where  each  run  is  of  length  >  1  (note 
that  there  can  be  at  most  B  runs  in  such  a  profile).  In  this  section  our  goal  is  to  spend 
as  little  time  as  possible  running  h  on  x  so  as  to  obtain  a  function  rj/,  :  Vb  — >  [0, 1]  with 
the  following  property:  for  any  profile  P,  E  [qh(P)]  =  qh(P)  (i.e.,  qifP)  is  an  unbiased 
estimate  of  qh{P)  for  all  P  e  Vb)- 

To  obtain  such  a  function,  we  perform  B  independent  runs  of  h  on  x,  where  the  ith  run 
is  performed  with  a  time  limit  of  =-.  Note  that  the  total  time  required  for  all  runs  of  h  is  at 
most  i  j  =  0(B  log  B).  Let  T%  be  the  time  the  ith  run  would  have  taken  if  it  had  been 
performed  with  no  time  limit  (whereas  we  only  have  knowledge  of  min  {  T, .  j}),  and  call 
the  tuple  T  =  (Ti,  P2, . . . ,  TB)  a  trace.  For  any  profile  P  =  (ti,  t2,  . . . ,  tl),  we  say  that 
T  encloses  P  if  T  >  T  for  all  i,  1  <  i  <  L  (see  Figure  3.2).  Our  estimate  is 

1  if  T  encloses  P 
'  y  0  otherwise. 
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Figure  3.2:  An  illustration  of  our  estimation  procedure.  The  profile  P  =  ( r1;  r2 )  (dots)  is 
enclosed  by  the  trace  (T),  T2,  T3,  T4,  T5,  T6). 

The  estimate  is  unbiased  (for  profiles  of  size  <  B)  because  E  [g(P)]  =  P  [q(P)  =  1]  = 
nfi,  P  \  T  (. h,x )  >  Ti]  =  q(P).  Furthermore,  the  estimate  can  be  computed  given  only 
knowledge  of  min{Tj,  j}.  This  is  true  because  if  P  =  (rly  r2, . . . ,  rf)  is  a  profile  with 
size  <  B,  then  for  each  i  we  have  r,  <  j  (recall  that  the  sequence  (n,  r2, . . . ,  tl)  is 
non-increasing  by  definition). 

Thus  we  have  proved  the  following  lemma. 

Lemma  11.  Given  an  instance  x  and  a  heuristic  h,  after  running  hfor  time  O  (B  log  B) 
(or  time  at  most  B,  if  h  is  deterministic ),  we  can  obtain  a  function  (h,  such  that  for  any 
profile  P,E[qh(P)\  =  qh(P). 


Estimating  schedule  running  times 

Given  the  ability  to  obtain  unbiased  estimates  of  failure  probabilities,  we  can  readily  obtain 
unbiased  estimates  of  the  running  time  of  any  schedule.  Fix  a  training  instance  x.  For  any 
state  Y  =  (Pi,  P 2 , . . . ,  Pk),  let  Q(Y)  be  the  probability  that  none  of  the  runs  in  Y  yield  a 
solution  to  x : 

k 

Q(Y)  =  l[qhj(Pj).  (3.7) 

3= 1 
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We  can  obtain  an  unbiased  estimate  of  Q  as  follows.  For  each  h  E  Tt,  let  cp  be  an  unbiased 
estimate  of  qh,  obtained  (for  example)  using  the  procedure  just  described.  Let  0  be  the 
function  obtained  by  plugging  qhj  in  for  qhj  in  equation  (3.7),  for  all  j  (1  <  j  <  k ).  Given 
the  instance  x ,  the  functions  q\,  q-2,  ■  ■ . ,  qk  are  independent.  For  any  independent  random 
variables  A  and  B ,  E  [AB]  =  E  [A]  •  E  [B].  Thus,  for  any  state  Y  =  (P1;  P2, . . . ,  Pf),  we 
have 

k  k 

E  [<200]  =  IIE  =  IlMP)  =  Q(Y)  •  (3-8) 

3=1  3=1 

Now  fix  an  edge  e  =  (lj ,  Y2).  Let  Si  be  a  schedule  such  that  y(S i)  =  Yx.  Let 
S  =  S'e(e),  let  ti  —  £  (Si),  and  let  t2  —  £  (S\  ©  S).  As  defined  in  equation  (3.4), 

w(e,x)=f  1  ~px((Si  ©  S){t))  dt  =  f  Q(y((Si  ©  S){tA)  dt . 

J  t=t\  J  t=t\ 


Let  w(e,x)  be  the  estimate  of  w(e,  x)  obtained  by  using  Q  in  place  of  Q  in  this  equation. 
We  have 


E  [w(e,  a;)]  =  E 


fi2 

Q(y(Si  ©  S(t}))  dt 

I _j  t—t\ 

( * 2  E  [Q(y(S'i  ©  S{t)))]  dt 

t=t\ 

r  Q(y(Si@s{t)))dt 

t=ti 


=  w(e,  x)  . 


(linearity  of  expectation) 
(equation  (3.8)) 


Thus,  given  an  arbitrary  instance  x,  after  running  each  deterministic  heuristic  for  time 
at  most  B  and  running  each  randomized  heuristic  for  time  at  most  0  ( B  log  B),  we  can 
obtain  an  unbiased  estimate  of  w(e,  x)  for  any  edge  e  the  state  space  graph. 

Finally,  if  S  is  a  schedule  corresponding  to  the  path  e\,  e2, . . . ,  eL,  then  W (S,  x)  = 
Yld= i  w( Q ,  x)  is  an  unbiased  estimate  of  W ( S ,  x)  by  linearity  of  expectation.  Thus,  if  we 
redefine  p,(S)  =  W (S,  x)  to  be  the  average  estimated  weight  assigned  to  S  over  the 
m  training  instances,  the  following  theorem  follows  by  the  same  argument  used  to  prove 
Theorem  22. 

Theorem  23.  After  running  each  deterministic  heuristic  for  time  at  most  B  per  training 
instance,  and  running  each  randomized  heuristic  for  time  at  most  O  ( B  log  B)  per  training 
instance,  we  can  obtain  a  function  p(S)  for  which  Theorem  22  holds.  That  is,  if  G  = 
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(V,  E.  Se)  is  a  state-space  graph  and  the  number  of  training  instances  m  satisfies  the 
inequality 

m  >  m0(e,  S,G)  =  O  ^  In  (3.9) 

then  the  inequality 

max 
SeSG 

holds  with  probability  at  least  1  —  5. 

Theorem  23  is  significant  for  two  reasons.  First,  it  implies  that  if  we  optimize  over  the 
training  instances  in  order  to  find  a  schedule  with  minimum  estimated  average  expected 
running  time  on  the  training  instances,  we  will  obtain  the  same  guarantees  as  if  we  had 
(somehow)  found  a  schedule  with  minimum  actucd  average  expected  running  time  on  the 
training  instances.  Second,  and  perhaps  more  importantly,  being  able  to  obtain  unbiased 
estimates  of  schedules’  running  times  will  be  critical  for  making  our  algorithms  work  in 
the  online  setting. 


m  -  pjs) 

(s) 


Refining  the  estimation  procedure 

Although  the  procedure  for  estimating  failure  probabilities  described  in  Lemma  1 1  is  suffi¬ 
cient  to  obtain  all  our  theoretical  results,  the  estimate  is  somewhat  crude  in  that  the  estimate 
of  q(P )  (which  is  a  probability)  is  always  0  or  1.  The  following  lemma  (proved  in  Ap¬ 
pendix  A)  gives  a  more  refined  unbiased  estimate  which  we  will  use  in  our  experimental 
evaluation.  As  before,  computing  the  estimate  only  requires  knowledge  of  min  {  T, ,  j  } 
for  each  i. 

Lemma  12.  For  any  profile  P  =  (ti,  t2, . . . ,  tk)  of  size  <  B,  define  LfiP)  =  {i'  :  1  < 
i!  <  B,  jr  >  Ti}.  Then  the  quantity 

-  Cp\  TT  IK  e  Lf  P)  :  Tv  >  Ti}\  -  i  +  l 

qh(p) = y  inn  ~i+i 

is  an  unbiased  estimate  ofqh(P)  (i.e.,  E  [qh{P)]  =  Qh(P))- 


3.6  Online  Algorithms 

One  weakness  of  the  results  of  §3.5  is  that  they  assume  we  can  draw  training  (and  test) 
instances  independently  at  random  from  a  fixed  probability  distribution.  In  practice,  the 
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distribution  might  change  over  time  and  successive  instances  might  not  be  independent. 
For  example,  a  SAT  solver  might  be  used  to  solve  a  set  of  problem  instances  derived 
from  a  particular  application  domain,  and  then  be  used  to  solve  another  set  of  instances 
from  a  different  domain.  To  further  illustrate  this  point,  consider  again  the  example  from 
§3.1.1  of  selecting  a  restart  schedule  for  the  SAT  solver  used  by  SATPLAN  to  solve  one 
or  more  planning  problems.  To  solve  a  particular  planning  problem,  SATPLAN  generates 
a  sequence  (ay,  cr2, . . .)  of  Boolean  formulae  that  are  not  at  all  independent.  To  optimize 
S  ATP  LAN’s  performance,  we  would  like  to  leam  a  restart  schedule  for  the  underlying  SAT 
solver  on-the-fly,  without  making  strong  assumptions  about  the  sequence  of  formulae  that 
are  fed  to  it. 

In  this  section  we  consider  the  problem  of  selecting  schedules  in  a  worst-case  online 
setting.  In  this  setting  we  are  fed,  one  at  a  time,  a  sequence  (xi,  x2, . . . ,  xn)  of  problem 
instances  to  solve.  Prior  to  receiving  instance  X;,  we  must  select  a  schedule  S'*  G  S.  As 
in  previous  sections,  we  confine  our  attention  to  schedules  that  run  each  heuristic  for  time 
at  most  B ,  for  some  budget  B  >  0.  We  then  use  Si  to  solve  Xi  and  incur  cost  Cl  equal 
to  the  CPU  time  we  spend  running  the  heuristics  in  H  on  Xi,  where  the  running  time  is 
artificially  truncated  at  time  Bk, so 


E  [Cj\  =  E  [min  Bk,  T  (Si,  Xj)]  . 


As  in  Chapter  2,  some  form  of  truncation  is  necessary,  for  otherwise  our  algorithm  might 
be  forced  to  spend  an  arbitrarily  large  amount  of  time  on  one  particular  instance  and  we 
could  not  prove  any  meaningful  bounds  on  its  performance.  After  solving  Si  we  know 
the  CPU  time  that  was  required,  but  nothing  more  (in  particular,  we  do  not  know  the 
distribution  of  T  ( h ,  Xj)  for  each  h  G  H).  Our  a-regret4  after  having  received  n  instances 
is  equal  to 

E  [T  (S,  Xj)]  |  (3.10) 

where  So  C  S  is  some  set  of  schedules,  and  where  the  expectation  is  over  two  sources  of 
randomness:  the  random  bits  supplied  to  the  heuristics  in  Tt,  but  also  any  random  bits  used 
by  our  schedule-selection  strategy.  A  strategy’s  worst-case  regret  is  the  maximum  value 
of  (3.10)  over  all  instance  sequences  of  length  n. 

4  a- regret  is  unrelated  to  a-regularity. 


E 


i= 1 


a  ■  mm 
Ses0 
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3.6.1  Handling  a  small  pool  of  schedules 


Assume  for  the  moment  that  we  are  given  a  set  So  of  schedules  to  select  from,  where  |«S0| 
is  small  enough  that  we  would  not  mind  using  O(|«S0|)  time  or  space  for  decision  making. 
In  this  case  one  option  is  to  treat  our  online  problem  as  an  instance  of  the  “nonstochastic 
multiarmed  bandit  problem”  and  use  the  Exp3  algorithm  of  Auer  et  al.  [5]  to  obtain  regret 

To  obtain  regret  bounds  with  a  better  dependence  on  |«S0|,  we  use  a  version  of  the 
“label-efficient  forecaster”  of  Cesa-Bianchi  et  al.  [17].  Applied  to  our  online  problem, 
this  strategy  behaves  as  follows.  Given  an  instance  x,  with  probability  7  the  strategy 
explores  by  computing  an  unbiased  estimate  of  E  [min  {Bk,  T  (S,  x)}]  for  each  S  G  S0, 
using  the  estimation  procedure  described  in  §3.5.2.  Recall  that  this  estimation  procedure 
requires  us  to  spend  time  at  most  B  running  each  deterministic  heuristic,  and  time  at  most 
y  =  0  (B  log  B)  running  each  randomized  heuristic.  We  denote  by  F  the  total 
(maximum)  running  time  over  all  h  e  H. 

With  probability  1  —  7,  the  strategy  exploits  by  selecting  a  schedule  at  random  from 
a  distribution  in  which  schedule  S  is  assigned  probability  proportional  to  exp(— rjc(S)), 
where  c(S)  is  an  unbiased  estimate  of  [min  {Bk,T  (S,  £/)}]  (obtained  by  sum¬ 

ming  the  estimates  from  each  exploration  round,  and  multiplying  by  ^ )  and  77  is  a  learning 
rate  parameter.  By  Theorem  1  of  Cesa-Bianchi  et  al.  [17],  the  regret  is  at  most 


Bk 


In  I  So  I  ,  V  '  n  \ 
V  27  ) 


+  7 nF  . 


Optimizing  7  and  rj  yields  the  following  theorem. 


Theorem  24.  The  label-efficient  forecaster  with  learning  rate  1 ]  =  (^"  [‘7' 
exploration  probability  7  =  \J r^L  has  1 -regret  at  most  2 Bkni  (2  In  |<S0 1 77) ■ 


2/3 


and 


3.6.2  Online  shortest  path  algorithms 

In  the  previous  section  we  described  how  the  label-efficient  forecaster  of  Cesa-Bianchi  et 
al.  [17]  can  be  used  to  converge  to  an  optimal  schedule  within  a  set  So,  but  the  time  and 
space  required  for  decision-making  were  both  O  (|«S0|).  In  this  section,  we  describe  how 
the  label-efficient  forecaster  can  be  implemented  more  efficiently  for  certain  sets  S0  by 
exploiting  the  shortest  path  formulation  discussed  in  §3.3.2. 
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Let  G  =  (V,E,Se)  be  a  state- space  graph,  and  let  50  =  <$g-  Using  the  dynamic 
programming  approach  described  by  Gyorgy  et  al.  [36],  we  can  maintain  the  distribution 
over  schedules  (i.e.,  paths)  used  by  the  label-efficient  forecaster  implicitly,  by  maintaining 
a  weight  for  each  edge  in  our  graph.  The  space  required,  as  well  as  the  time  required  for 
each  exploitation  step,  then  become  O  (\E\)  rather  than  O  (|«S0|),  a  potentially  dramatic 
improvement. 

The  total  decision-making  time  required  by  this  approach  is  O  (n  \E\).  By  using  a 
“lazy”  implementation  of  the  exploitation  steps,  we  can  reduce  the  total  decision-making 
time  to  O  (m\E\),  where  m  is  the  number  of  exploration  rounds  (note  that  this  is  com¬ 
parable  to  the  amount  of  time  required  to  solve  the  offline  shortest  path  problem  on  the 
m  training  instances).  The  idea  of  this  approach  is  to  only  resample  a  schedule  whenever 
the  distribution  over  schedules  changes  (i.e.,  sample  once  after  each  exploration  step).  On 
any  particular  round,  this  variant  of  the  algorithm  has  the  same  expected  behavior  as  the 
original  version,  and  thus  by  linearity  of  expectation  the  overall  worst-case  regret  bounds 
are  unchanged.  This  approach  has  been  used  in  other  online  algorithms  (e.g.,  see  [42]). 


3.6.3  Online  greedy  algorithm 

In  §3.1.4  we  showed  that  the  online  problem  considered  in  this  chapter  satisfies  the  suf¬ 
ficient  conditions  required  by  the  online  greedy  algorithm  for  Min-Sum  Submodular 
Cover  from  Chapter  2.  In  this  section  we  flesh  out  the  guarantees  that  the  online  greedy 
algorithm  provides  for  the  online  problem  considered  in  this  chapter. 

We  will  confine  our  attention  to  schedules  in  which  each  action  duration  is  an  integer 
between  1  and  B,  and  in  which  each  heuristic  is  run  for  total  time  at  most  B  (where  this 
time  may  be  spread  across  multiple  runs).  We  use  Sb  to  denote  the  set  of  such  schedules. 

We  first  consider  the  online  greedy  algorithm  OGp,  defined  in  §2.4.4.  Recall  that  this 
algorithm  OGp  can  be  run  in  a  priced  feedback  model  in  which,  to  receive  access  to  the 
function  px  after  solving  instance  x,  we  must  pay  a  price  that  is  then  added  to  the  regret. 
If  all  heuristics  are  deterministic,  then  after  running  each  heuristic  on  instance  x  with  a 
time  limit  of  B ,  we  are  able  to  the  determine  value  of  px(S)  for  any  S  e  Sb-  If  one 
or  more  heuristics  are  randomized,  then  after  running  each  randomized  heuristic  for  time 
O  ( B  log  B ),  we  can  obtain  an  unbiased  estimate  of  px(S)  for  any  S  G  SB,  as  described 
in  §3.5.2.  As  far  as  the  analysis  of  OGp  is  concerned,  unbiased  estimates  of  px(S)  are  as 
good  as  the  true  values.  Thus,  we  obtain  the  following  bound  as  a  corollary  of  Theorem  1 1 . 
Here,  as  in  §3.6.1,  we  use  F  to  denote  the  maximum  CPU  time  required  by  the  estimation 
procedure  described  in  §3.5.2. 
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Corollary  6.  Algorithm  OGp,  run  with  WMR  as  the  subroutine  experts  algorithm,  has 
4-regret  O  Bk  In  n)  §  ( In  |^|)s  (n)  3  j  with  respect  to  the  set  Sit  =  Sb- 

We  next  consider  the  partially  transparent  feedback  model.  Recall  that  in  this  model, 
after  selecting  a  schedule  5)  to  use  on  instance  Xi,  one  leams  the  value  of  pXi(Spt))  for 
all  t  >  0.  If  all  heuristics  are  deterministic,  then  the  information  revealed  in  this  model  is 
always  available:  pXi(Si ^>)  equals  1  if  t  >  T  ( Si,Xi ),  and  equals  0  otherwise.  If  some 
heuristics  are  randomized,  then  knowing  T  (Sr,  xt)  does  not  reveal  the  exact  value  of 
Pxi(Si(t >),  but  for  any  t  >  0  it  provides  an  unbiased  estimate.  Again,  for  the  purposes 
of  the  analysis  given  in  §2.4.4  for  the  partially  transparent  feedback  model,  these  unbi¬ 
ased  estimates  are  as  good  as  the  true  values.  Thus,  we  obtain  the  following  bound  as  a 
corollary  of  Theorem  12. 

Corollary  7.  Algorithm  OG,  run  with  Exp3  as  the  subroutine  experts  algorithm,  has 
4-regret  O  Bk  In  n)2  \/vBk  In  Bk'j  with  respect  to  the  set  S0  =  Sb- 

Given  these  two  corollaries,  it  is  natural  to  consider  hybrid  algorithms  that  always 
make  use  of  the  feedback  provided  in  the  partially  transparent  model,  and  occasionally 
pay  the  exploration  cost  F  in  order  to  obtain  the  information  made  available  in  the  priced 
feedback  model.  We  have  analyzed  such  algorithms  and  consider  them  promising  from 
a  practical  point  of  view.  However,  they  do  not  yield  regret  bounds  that  improve  on  the 
minimum  of  the  two  bounds  just  stated  (as  a  function  of  n,  k,  B ,  and  F)  by  more  than  a 
constant  factor. 


3.7  Handling  Optimization  Problems 

In  this  section  we  describe  how  the  results  of  this  chapter  can  be  applied  to  optimization 
problems,  as  opposed  to  decision  problems.  In  this  context,  we  will  assume  that  instead 
of  simply  returning  a  “yes”  or  ”no”  answer,  our  heuristics  are  anytime  algorithms  that 
return  solutions  of  increasing  quality  over  time.  When  constructing  a  schedule  to  use  in 
combining  such  heuristics,  the  “cost”  of  a  schedule  should  depend  on  how  solution  quality 
changes  as  a  function  of  time  (and  should  not,  for  example,  depend  only  on  the  time 
required  to  find  a provably  optimal  solution). 

To  better  understand  the  motivation  for  the  results  in  this  section,  consider  Figure  3.3, 
which  depicts  the  behavior  of  four  solvers  on  an  instance  from  the  2006  pseudo-Boolean 
evaluation  (pseudo-Boolean  optimization  is  another  name  for  zero-one  integer  program¬ 
ming).  On  this  instance,  bsolo  is  the  first  solver  to  generate  a  feasible  solution,  while 
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Mini  Sat  1 . 1 4  is  the  first  to  find  an  optimai  solution  and  also  the  first  to  prove  optimal¬ 
ity.  On  the  other  hand,  SAT  4  J  Heur  .  generates  a  near-optimal  solution  very  quickly  but 
is  unable  to  prove  optimality  (within  the  half  hour  time  limit).  By  combining  such  heuris¬ 
tics  using  a  task-switching  schedule,  we  might  hope  to  take  advantage  of  their  different 
strengths. 


Time 


Figure  3.3:  Behavior  of  four  solvers  on  instance  “normalized-mps-v2-20-10-lseu.opb” 
from  the  2006  pseudo-Boolean  evaluation. 


We  now  describe  a  simple  way  to  extend  the  results  of  this  chapter  to  cost  functions  that 
account  for  solution  quality.  Let  us  define,  for  each  instance,  a  set  of  objectives  to  achieve. 
For  example,  we  might  want  to  find  a  feasible  solution  to  an  optimization  problem,  and 
also  to  find  a  (provably)  optimal  solution.  Then,  for  each  instance  x  E  X,  we  create  a  new 
set  of  fictitious  instances  Xi,  x2, . . . ,  xk,  one  for  each  of  the  k  objectives.  For  each  heuristic 
h  E  H,  we  define  T  (h,  xt)  to  be  the  time  that  h  requires  to  achieve  the  ith  objective.  Thus, 
the  average  time  a  schedule  or  heuristic  takes  to  “solve”  the  fictitious  instances  is  simply 
the  average  time  it  takes  to  achieve  each  of  the  k  objectives  on  the  original  instances. 
If  some  objectives  are  more  important  than  others,  we  can  assign  different  weights  to  the 
fictitious  instances  corresponding  to  each  different  objective.  All  the  results  of  this  chapter 
readily  extend  to  weighted  sets  of  instances. 

We  evaluate  this  approach  experimentally  in  §3.9.5. 
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3.8  Exploiting  Features  of  Instances 


So  far  in  this  chapter  we  have  imagined  that  we  must  select  a  schedule  to  use  in  solving 
instance  x  based  only  on  our  experience  with  previously-encountered  instances,  and  not 
on  any  properties  of  the  instance  x  itself.  In  practice,  there  may  be  quickly-computable 
features  that  distinguish  one  instance  from  another  and  suggest  the  use  of  different  heuris¬ 
tics. 


In  this  section,  we  describe  how  existing  techniques  for  solving  the  so-called  sleeping 
experts  problem  can  be  used  to  exploit  features  of  instances  in  an  attractive  way.  We 
will  suppose  that  each  problem  instance  is  labeled  with  the  values  of  M  Boolean  features. 
Given  an  instance,  we  may  examine  the  values  of  the  features  (at  zero  cost)  before  selecting 
a  schedule.  Roughly  speaking,  the  sleeping  experts  algorithms  allow  us  to  create  variants 
of  our  online  algorithms  such  that,  even  if  regret  is  calculated  using  only  the  instances 
for  which  a  particular  feature  is  true,  our  online  algorithm’s  usual  regret  bounds  will  still 
hold  (this  is  true  simultaneously  for  all  features).  For  example,  if  each  instance  is  labeled 


as  either  “random”  or  “industrial”  and  also  labeled  as  either  “small”  or  “large”,  then  our 


online  algorithm’s  usual  regret  bounds  will  hold,  and  they  will  also  hold  even  if,  when 


calculating  regret,  we  only  consider  the  “random”  instances  or  only  consider  the  “large” 


instances. 


Background:  the  sleeping  experts  problem 

The  problem  of  combining  different  sources  of  expert  advice  was  discussed  in  §2.4.  Recall 
that  in  this  problem,  one  has  access  to  a  set  of  M  experts,  each  of  whom  gives  out  a  piece 
of  advice  every  day.  On  each  day  i,  one  must  select  an  expert  et  whose  advice  to  follow. 
Following  the  advice  of  expert  j  on  day  i  yields  a  reward  r*.  At  the  end  of  day  i,  the  value 
of  the  reward  r*  for  each  expert  j  is  made  public,  and  can  be  used  as  the  basis  for  making 
choices  on  subsequent  days.  One’s  regret  at  the  end  of  n  days  is  equal  to 


max 
1  <j<M 


n 


i—  1 


(3.11) 


Note  that  the  rewards  assigned  to  an  expert  on  the  first  i  —  1  days  do  not  necessarily 
imply  anything  about  its  reward  on  day  i.  Nevertheless,  the  randomized  weighted  majority 

algorithm  [60]  can  be  used  to  achieve  worst-case  regret  O  (y/n  In  M  j .  Thus,  as  n  — >  oo, 
the  randomized  weighted  majority  algorithm’s  average  reward  approaches  (or  exceeds)  the 
maximum  reward  that  could  be  received  by  following  the  advice  of  a  single  expert  on  all 
n  days. 
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Now  suppose  that,  on  any  particular  day  i,  a  given  expert  j  may  abstain  from  making  a 
prediction  (in  this  case  r*  =  0).  When  choosing  an  expert  on  day  i,  one  must  pick  an  expert 
that  made  a  prediction  on  that  day  (assume  that  at  least  one  expert  makes  a  prediction  every 
day).  Let  w;'  =  1  if  expert  j  makes  a  prediction  on  day  i\  otherwise  let  w*  =  0.  Define 
the  “j  regret”  of  an  experts  algorithm  at  the  end  of  n  days  in  analogy  to  (3.11),  but  only 
considering  the  days  on  which  expert  j  made  a  prediction: 


max 

1  <1<M 


E 

i—  1 


wjre. 

J 


(3.12) 


Generalizing  previous  work  by  Freund  et  al.  [29],  Blum  and  Mansour  [12]  presented  an 
algorithm  for  selecting  experts  in  this  setting  whose  j  regret  is  O  ( \/n  log  M  +  log  M), 
simultaneously  for  each  j.  In  fact,  the  algorithm  of  Blum  and  Mansour  is  more  general 
in  that  it  allows  for  real- valued  weights  Wj  G  [0, 1],  as  opposed  to  the  binary  weights 
Wj  G  (0, 1}  allowed  in  the  original  sleeping  experts  setting. 


Exploiting  features  when  selecting  schedules 


Our  online  schedule-selection  algorithms  can  be  combined  with  sleeping  experts  algo¬ 
rithms  in  a  natural  way.  For  each  feature  j,  we  create  a  copy  A:j  of  the  online  schedule- 
selection  algorithm  that  is  only  used  for  instances  where  feature  j  is  true.  We  then  treat 
each  Aj  as  a  sleeping  expert  whose  loss  equals  the  cost  of  the  schedule  that  At  selects 
(note  that  the  sleeping  experts  problem  can  be  defined  equally  well  in  terms  of  minimizing 
loss,  rather  than  maximizing  reward). 

The  code  for  algorithm  SE  illustrates  this  approach,  which  is  well-known.  As  defined 
in  this  code,  SE  can  only  be  run  in  the  full-information  feedback  model,  where  after 
solving  instance  Xi  we  receive  complete  knowledge  of  the  function  px..  However,  as  we 
discuss  later  in  this  section,  SE  can  be  modified  to  run  in  the  priced  feedback  model  (where 
receiving  access  to  px.  entails  paying  a  cost). 
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Algorithm  SE 

Input:  sleeping  experts  algorithm  £,  online  schedule-selection  algorithms 
Ai,  A2,  ■  ■  ■ ,  Am- 

For  1  from  1  to  n: 

1.  For  each  feature  j  that  is  true  for  instance  xt,  use  At  to  select  a 
schedule  Sitj. 

2.  Use  E  to  select  a  sleeping  expert  j,  and  select  the  schedule  St  =  Shr 

3.  Receive  access  to  the  function  pXi. 

4.  For  each  feature  j  that  is  true  for  instance  x,  : 

(a)  feed  back  the  function  px.  to  Av  and 

(b)  feed  back  the  cost  of  schedule  S,hJ  to  £  as  the  loss  it  would 
have  incurred  had  it  selected  sleeping  expert  j. 


The  performance  of  SE  is  summarized  by  the  following  fact,  which  follows  immedi¬ 
ately  from  the  definition  of  a-rcgrct  together  with  the  regret  bound  of  the  sleeping  experts 
algorithm  of  Blum  and  Mansour  [12]. 


Fact  2.  Consider  running  SE  with  the  sleeping  experts  algorithm  of  Blum  and  Mansour 
[12]  as  input,  and  with  a  set  of  subroutine  schedule-selection  algorithms  that  each  have 
worst-case  expected  a-regret  A  on  a  sequence  of  n  instances.  Then,  simultaneously  for 
each  feature  j,  it  holds  that  SE  has  worst-case  a-regret  at  most  A +0  ( fn  log  M  +  log  M) 
if  regret  is  calculated  using  only  the  instances  for  which  feature  j  is  true. 


As  defined,  SE  receives  complete  knowledge  of  the  function  pXi  after  selecting  sched¬ 
ule  Si,  and  then  uses  this  function  to  give  feedback  to  each  Aj.  Alternatively,  SE  can  be 
run  in  the  priced  feedback  model  by  only  performing  this  step  with  some  small  exploration 
probability  7,  as  described  in  §3.6.3.  In  this  setting,  Fact  2  can  be  used  in  conjunction  with 
Corollary  6  and  Lemma  5  to  obtain  bounds  on  4-regret  that  hold  (simultaneously  for  all  j ) 
when  regret  is  calculated  using  only  the  instances  that  have  feature  j. 
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Table  3.2:  Solver  competitions. 


Competition 

Venue 

Domain 

CASC-J3 

CADE  2007 

theorem  proving 

SMT-COMP’07 

CAV’07 

satisfiability  modulo  theories 

SAT  2007 

SAT  2007 

Boolean  satisfiability 

MaxSAT-2007 

SAT  2007 

maximum  satisfiability 

PB’07 

SAT  2007 

zero-one  integer  programming 

QBFEVAL’07 

SAT  2007 

quantified  Boolean  formulae 

CRAP  06 

CP  2006 

constraint  satisfaction 

IPC-5 

ICAPS  2006 

A. I.  planning 

3.9  Experimental  Evaluation 

In  this  section  we  present  an  experimental  evaluation  of  the  offline  and  online  algorithms 
described  in  this  chapter.  The  bulk  of  our  experimental  evaluation  consists  of  using  data 
from  recent  solver  competitions  to  determine  how  task-switching  schedules  constructed  by 
our  offline  and  online  algorithms  would  have  fared  had  they  been  entered  in  the  competi¬ 
tions.  In  these  experiments  we  consider  both  optimization  and  decision  problems,  and  we 
exploit  features  of  problem  instances  in  order  to  improve  performance.  At  the  end  of  this 
section,  we  present  experiments  in  which  we  construct  restart  schedules  for  a  randomized 
heuristic. 


3.9.1  Solver  competitions 

Each  year,  various  computer  science  conferences  hold  solver  competitions  designed  to 
assess  the  state  of  the  art  in  some  problem  domain.  In  these  competitions,  each  submitted 
solver  is  run  on  a  sequence  of  problem  instances,  subject  to  some  per-instance  time  limit. 
Solvers  are  awarded  points  based  on  the  instances  they  solve,  and  prizes  are  awarded  to 
the  highest-scoring  solvers.  Many  competitions  are  divided  into  tracks  corresponding  to 
different  categories  of  instances. 

In  this  section  we  describe  experiments  performed  using  data  from  the  eight  solver 
competitions  listed  in  Table  3.2.  Each  of  these  competitions  is  held  either  annually  or 
bi-annually,  and  the  competitions  listed  in  Table  3.2  are  the  most  recent  competitions  that 
had  taken  place  at  the  time  of  writing.  We  now  provide  a  brief  description  of  each  of  the 
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competitions  and  problem  domains. 


1.  CASC-J3.  Theorem  proving  is  the  task  of  finding  a  proof  that  a  given  theorem 
follows  from  a  given  set  of  axioms,  or  refuting  the  theorem.  The  annual  CASC  the¬ 
orem  prover  competition  evaluates  the  performance  of  theorem  provers  over  various 
logics. 

2.  SMT-COMP’07.  Satisfiability  modulo  theories  is  the  task  of  determining  whether 
a  logical  formula  is  true  with  respect  to  a  background  theory  expressed  in  classical 
first-order  logic  with  equality.  SMT  solvers  are  generally  used  to  solve  hardware 
and  software  verification  problems,  where  typical  background  theories  include  the 
theories  of  real  and  integer  arithmetic,  and  the  theories  of  various  data  structures 
such  as  arrays  and  fixed  size  bit  vectors. 

3.  SAT  2007.  Boolean  satisfiability  is  the  task  of  determining  whether  there  exists  an 
assignment  of  truth  values  to  a  set  of  Boolean  variables  that  satisfies  each  clause 
(disjunction)  in  set  of  clauses.  SAT  solvers  are  used  as  subroutines  in  state-of-the- 
art  algorithms  for  hardware  and  software  verification  and  A.I.  planning.  The  SAT 
2007  competition  included  industrial,  random,  and  hand-crafted  benchmarks. 

4.  Max-SAT  2007.  Maximum  satisfiability  is  the  optimization  problem  of  finding  an 
assignment  of  truth  values  to  a  set  of  Boolean  variables  that  maximizes  the  num¬ 
ber  of  satisfied  clauses  in  a  given  set  of  clauses.  The  2007  Max-SAT  evaluation 
contained  weighted  and  unweighted  Max-SAT  instances  that  encoded  various  opti¬ 
mization  problems,  including  graph-theoretic  problems  and  constraint  satisfaction 
problems. 

5.  PB’07.  Pseudo-Boolean  optimization  is  the  task  of  minimizing  a  function  of  zero- 
one  variables  subject  to  algebraic  constraints,  also  known  as  zero-one  integer  pro¬ 
gramming.  On  many  benchmarks,  pseudo-Boolean  optimizers  (which  are  usually 
based  on  SAT  solvers)  outperform  general  integer  programming  packages  such  as 
CPLEX  [2].  The  PB’07  evaluation  included  both  optimization  and  decision  (feasi¬ 
bility)  problems  from  a  large  number  of  domains,  including  formal  verification  and 
logic  synthesis,  as  well  as  various  numerical  and  graph-theoretic  problems. 

6.  QBFEVAL’07.  Determining  whether  a  quantified  Boolean  formula  (QBF)  is  true  or 
false  is  the  canonical  PSPACE-complete  problem.  The  2007  QBF  solver  evaluation 
included  instances  derived  from  A.I.  planning  and  formal  verification  problems,  as 
well  as  various  other  problems. 
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7.  CPAI’06.  Constraint  satisfaction  problems  entail  finding  an  assignment  of  values  to 
a  set  of  discrete  variables  so  as  to  satisfy  a  set  of  arbitrary  discrete  constraints.  In  the 
decision  version  of  the  problem,  the  goal  is  to  determine  whether  there  exists  an  as¬ 
signment  that  satisfies  all  the  constraints,  whereas  in  the  optimization  version  of  the 
problem,  the  goal  is  to  find  an  assignment  that  satisfies  as  many  constraints  as  pos¬ 
sible.  The  CPAI’06  competition  included  both  decision  and  optimization  problems, 
the  bulk  of  which  were  generated  randomly  from  various  distributions. 

8.  IPC-5.  A. I.  planning  is  the  problem  of  finding  a  sequence  of  actions  (called  a  plan) 
that  leads  from  a  starting  state  to  a  desired  goal  state,  according  to  some  formally- 
specified  model  of  how  actions  affect  the  state  of  the  world.  The  makespan  of  a 
plan  is  the  number  of  steps  in  the  plan,  treating  actions  that  can  be  performed  si¬ 
multaneously  as  a  single  step.  In  the  optimal  planning  track  of  IPC-5,  the  model 
of  the  world  is  specified  in  the  STRIPS  language  and  the  goal  is  to  find  a  plan  with 
(provably)  minimum  makespan.  The  optimal  planning  benchmarks  of  IPC-5  require 
solving  tasks  such  as  finding  a  sequence  of  biochemical  reactions  that  produce  a 
desired  set  of  substances,  moving  packages  between  locations  subject  to  time  and 
spatial  constraints,  and  scheduling  manufacturing  operations. 

3.9.2  Experimental  procedures 

Our  experiments  for  each  solver  competition  followed  a  common  procedure: 

1.  We  determined  the  value  of  T  (h,x)  for  each  heuristic  h  and  benchmark  instance 
x  using  data  available  on  the  competition  web  site  (we  did  not  actually  run  any  of 
the  solvers).  Note  that  the  solvers  considered  in  these  competitions  are  deterministic 
(or  randomized,  but  run  with  a  fixed  random  seed),  so  T  ( h ,  x)  is  simply  a  single 
numeric  value.  For  optimization  problems,  we  define  T  (h.  x)  to  be  the  time  required 
to  obtain  a  provably  optimal  solution  (or  to  prove  that  the  problem  is  infeasible).  If 
a  solver  did  not  finish  within  the  competition  time  limit,  then  T  ( h ,  x)  is  undefined. 

2.  We  discarded  any  instances  that  none  of  the  solvers  could  solve  within  the  time  limit. 
(Clearly,  no  task-switching  schedule  could  solve  any  such  instance  within  the  time 
limit  either.) 

Given  a  schedule  S  and  instance  x,  we  will  not  generally  be  able  to  determine  the  true 

value  of  T  (S,x),  due  to  the  fact  that  T  (h,x)  is  undefined  for  some  heuristics  h  G  H. 

Instead,  we  will  measure  the  performance  of  S  on  x  in  terms  of  upper  and  lower  bounds 
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on  the  true  value  of  T  (S,  x ),  computed  as  follows.  The  lower  bound  is  simply  the  value 


min  { B ,  T  ( S ,  x)} 

where  B  is  the  competition  time  limit.  Given  knowledge  of  min  { B ,  T  {h.  x)}  for  each 
h  6  77,  we  can  determine  the  value  of  min  { B.  T  (S,  a:)}  exactly.  The  upper  bound  is  the 
value  of  T  ( S ,  x),  computed  after  artificially  setting  T  (h,  x)  =  oo  for  heuristics  that  did 
not  solve  x  within  the  competition  time  limit. 

When  evaluating  offline  schedule  selection  algorithms  such  as  the  greedy  approxi¬ 
mation  algorithm  from  §3.4,  we  are  concerned  about  the  possibility  of  overfitting  the 
solver  competition  data.  To  address  this  possibility,  we  use  leave-one-out  cross-validation. 
Leave-one-out  cross-validation  is  performed  as  follows:  for  each  instance  x,  we  remove 
x  from  the  data  set  and  then  run  the  offline  algorithm  on  the  remaining  data  to  obtain  a 
schedule  to  use  in  solving  x.  We  then  measure  the  average  performance  of  these  schedules 
over  all  x  E  X. 

In  these  experiments,  we  consider  schedules  that  execute  all  heuristics  in  the  suspend- 
and-resume  model,  as  well  as  schedules  that  execute  all  heuristics  in  the  restart  model 
(when  the  model  is  not  mentioned  explicitly,  we  use  the  suspend-and-resume  model).  Note 
that  even  if  all  available  heuristics  are  deterministic,  the  restart  model  may  be  useful  due 
to  memory  limitations.  Recall  from  §3.4.2  that  the  greedy  approximation  algorithm  can  be 
used  to  produce  a  schedule  optimized  for  execution  under  either  model.  Also  recall  that, 
by  Remark  1,  there  is  no  loss  associated  with  the  restart  model  from  the  point  of  view  of 
worst-case  approximation  guarantees. 


3.9.3  Experiments  with  the  shortest  path  algorithm 

In  this  section  we  present  experiments  in  which  the  number  of  heuristics  is  small  enough 
that  we  can  compute  an  optimal  task-switching  schedule  using  the  shortest  path  algorithm 
described  in  Theorem  20  of  §3.4.3.  This  allows  us  to  evaluate  the  potential  benefits  of  task¬ 
switching  schedules,  and  also  to  determine  how  close  the  schedules  returned  by  the  greedy 
approximation  algorithm  are  to  optimality.  The  experiments  described  in  this  section  use 
the  data  from  the  SAT  2005  competition.  We  used  the  2005  data  because  the  data  from  the 
SAT  2007  competition  was  not  yet  available  at  the  time  these  experiments  were  performed 
(the  SAT  competition  was  not  held  in  2006). 

For  each  of  the  three  instance  categories  from  the  SAT  2005  competition  (industrial, 
random,  and  hand-crafted),  we  computed  an  optimal  task-switching  schedule  for  the  two 
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solvers  that  won  first  prize  in  the  satisfiable  and  unsatisfiable  subsets  of  that  category.5  We 
use  only  the  top  two  solvers,  because  we  found  that  computing  an  optimal  task-switching 
schedule  for  three  or  more  solvers  was  too  computationally  expensive  to  be  practical  (for 
the  sets  of  benchmark  instances  considered  here).  When  evaluating  the  top  two  solvers 
within  each  category  we  consider  only  the  instances  that  belong  to  that  category,  and,  as 
already  mentioned,  we  discarded  instances  that  neither  of  the  solvers  could  solve  within 
the  time  limit.6 


Table  3.3:  Results  for  the  SAT  2005  competition. 


Category  (instances) 

Solver 

Avg.  CPU  (s) 
[lower,  upper] 

Num.  solved 

Industrial  (268) 

Optimal  schedule 

[793,793] 

268 

Greedy  schedule 

[794,794] 

268 

Greedy  schedule  ( CV) 

[810,810] 

268 

SatELiteGTI 

[958,oo] 

267 

Parallel  schedule 

[1222,1264] 

265 

MiniSat  1.13 

[1759,oo] 

250 

Random  (284) 

Optimal  schedule 

[1015,1173] 

261 

Greedy  schedule 

[1015,1173] 

261 

Greedy  schedule  ( CV) 

[1050,1221] 

260 

Parallel  schedule 

[1081,1325] 

257 

ranov 

[2026,  oo] 

209 

kcnf s-2004 

[2874,oo] 

167 

Hand-crafted  (403) 

Optimal  schedule 

[483,538] 

391 

Greedy  schedule 

[483,540] 

391 

Parallel  schedule 

[542,643] 

388 

Greedy  schedule  ( CV) 

[585,655] 

386 

Val 1st 

[1095, oo] 

343 

SatELiteGTI 

[1214,oo] 

350 

5  In  the  industrial  category,  the  solver  SatELiteGTI  won  first  prize  for  both  the  satisfiable  and  unsat¬ 
isfiable  subsets,  so  we  instead  combined  it  with  one  of  the  second-place  solvers. 

6The  time  limit  for  the  second  stage  of  the  SAT  2005  competition  was  200  minutes  for  industrial  instances 
and  100  minutes  for  random  and  hand-crafted  instances. 
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Table  3.3  displays  the  upper  and  lower  bounds  on  average  CPU  time,  as  well  as  the 
number  of  instances  solved  within  the  competition  time  limit,  for  various  solvers:  the  top 
two  solvers  in  each  category,  a  schedule  that  simply  runs  these  solvers  in  parallel,  the 
optimal  task-switching  schedule,  and  the  task-switching  schedule  returned  by  the  greedy 
approximation  algorithm.  (As  described  in  §3.9.2,  we  cannot  determine  the  average  CPU 
time  required  by  each  solver  and  schedule  exactly,  but  we  can  compute  upper  and  lower 
bounds.) 

In  all  three  categories,  the  optimal  schedule  improves  on  the  original  solvers  in  terms 
of  average  CPU  time  and  in  terms  of  the  number  of  instances  solved  within  the  time  limit. 
In  terms  of  the  upper  bound  on  average  CPU  time,  the  improvement  is  unbounded.  In 
terms  of  the  lower  bound  on  average  CPU  time,  the  improvement  is  by  factors  of  1.21, 
2.00,  and  2.27  for  the  industrial,  random,  and  hand-crafted  categories,  respectively.  The 
optimal  schedule  also  improves  on  the  naive  parallel  schedule,  both  in  terms  of  average 
CPU  time  and  in  terms  of  the  number  of  instances  solved  within  the  time  limit.  Another 
interesting  feature  of  Table  3.3  is  that  the  schedules  returned  by  the  greedy  approximation 
algorithm  are  very  close  to  optimal  (the  average  CPU  time  is  within  0.2%  of  optimal  in  all 
three  cases). 

Some  of  these  performance  improvements  are  not  surprising.  In  the  random  category, 
one  of  the  two  solvers  (kcnfs-2004)isa  complete  solver,  whereas  the  other  (ranov) 
is  based  on  local  search  and  can  only  solve  satisfiable  formulae  (in  the  other  two  cat¬ 
egories,  both  solvers  are  complete).  It  thus  seems  natural  that  hybridizing  ranov  and 
kcnfs-2  004  could  yield  improved  performance  on  a  mixture  of  satisfiable  and  unsat- 
isfiable  instances.  What  is  perhaps  surprising  is  that  the  performance  can  be  improved 
simply  by  interleaving  the  execution  of  the  two  solvers  according  to  an  appropriate  sched¬ 
ule. 


kcnfs-2004 

ranov 


1  10  100  1000  10000 
time  (s) 


Figure  3.4:  The  optimal  task-switching  schedule  for  interleaving  kcnfs-2004  and 
ranov,  the  top  two  solvers  in  the  random  instance  category. 

Figure  3.4  illustrates  the  optimal  task-switching  schedule  for  interleaving  the  top  two 
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solvers  in  the  random  instance  category.  As  illustrated  in  the  figure,  the  optimal  schedule 
first  runs  ranov  for  about  10  seconds,  then  runs  kcnf  s-2004  for  about  7  seconds  (note 
logarithmic  scale),  and  so  on. 

To  address  the  possibility  of  overfitting,  we  repeated  the  experiments  with  the  greedy 
approximation  algorithm  using  leave-one-out  cross-validation,  as  described  in  §3.9.2.  Un¬ 
der  cross-validation,  the  lower  bound  on  greedy  schedule’s  average  CPU  time  increased 
by  about  2%  in  the  industrial  category,  and  by  about  3.5%  and  21%  in  the  random  and 
hand-crafted  categories,  respectively. 

3.9.4  Experiments  with  a  larger  number  of  heuristics 

In  the  previous  section  we  saw  that  the  performance  of  the  greedy  approximation  algo¬ 
rithm  was  very  close  to  optimal  when  using  it  to  combine  the  top  two  solvers  from  the 
industrial,  random,  and  hand-crafted  tracks  of  the  SAT  2005  competition.  In  this  section, 
we  present  experiments  involving  a  larger  number  of  heuristics,  using  data  from  the  IPC-5 
A.I.  planning  competition. 

Six  planners  were  entered  in  the  optimal  planning  track  of  the  Fifth  International  Plan¬ 
ning  Competition  (IPC-5).  Each  planner  was  run  on  240  instances,  with  a  time  limit  of  30 
minutes  per  instance.  On  1 10  of  the  instances,  at  least  one  of  the  six  planners  was  able  to 
find  a  (provably)  optimal  plan.  As  described  in  §3.9.2,  we  used  the  greedy  algorithm  to 
construct  an  approximately  optimal  task-switching  schedule,  given  as  input  the  completion 
times  of  each  of  the  six  planners  on  each  of  these  110  instances. 


Table  3.4:  Results  for  the  optimal  planning  track  of  IPC-5. 


Solver 

Avg.  CPU 

[lower, upper] 

Num.  solved 

Greedy  schedule 

[307,358] 

98 

Greedy  schedule  ( CV) 

[315,434] 

97 

Greedy  schedule  ( restart ) 

[332,426] 

96 

Greedy  schedule  ( restart,  CV) 

[368,551] 

95 

Parallel  schedule 

[456,1244] 

89 

SATPLAN 

[507,  oo] 

83 

Parallel  schedule  ( restart ) 

[527,2145] 

89 

Maxplan 

[641,  oo] 

88 

continued  on  next  page. . . 

86 


Table  3.4  (continued  from  previous  page) 


Solver 

Avg.  CPU 
[lower, upper] 

Num.  solved 

MIPS-BDD 

[946,  oo] 

54 

CPT2 

[969,  oo] 

53 

FDP 

[1079, oo] 

46 

IPPLAN-1SC 

[1437,oo] 

23 

Table  3.4  presents  the  results.  In  this  table,  the  schedules  marked  with  “(restart)” 
indicate  schedules  executed  in  the  restart  model,  and  the  schedules  marked  with  “(CV)” 
indicate  the  results  of  leave-one-out  cross-validation.  The  schedule  “Parallel  (restart)” 
indicates  a  schedule  that  runs  each  of  the  k  heuristics  for  T  time  units  each,  starting  with 
T  =  1  and  repeatedly  doubling  T. 

As  Table  3.4  shows,  the  greedy  schedules  outperform  the  naive  parallel  schedule  (which 
simply  runs  all  six  planners  in  parallel)  as  well  as  each  of  the  six  individual  planners,  both 
in  terms  of  (lower  and  upper  bounds  on)  average  CPU  time  and  in  terms  of  the  number  of 
instances  solved  within  the  30  minute  time  limit.  Note  that,  in  contrast  to  the  experiments 
in  the  previous  section,  the  greedy  schedules  now  outperform  the  naive  parallel  schedule 
by  a  substantial  factor,  particularly  in  terms  of  the  upper  bound  on  average  CPU  time:  the 
upper  bound  on  the  parallel  schedule’s  average  CPU  time  is  about  3.5  times  that  of  the 
greedy  schedule  in  the  suspend-and-resume  model,  and  about  5  times  that  of  the  greedy 
schedule  in  the  restart  model. 


S ATP LAN 

Maxplan 

MIPS-BDD 

CPT2 

FDP 


□  □  □ 
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Figure  3.5:  Greedy  task-switching  schedule  for  interleaving  solvers  from  IPC-5. 
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Figure  3.5  shows  the  task-switching  schedule  constructed  by  the  greedy  approxima¬ 
tion  algorithm  (the  solver  IPPLAN-1SC  is  not  shown  because  it  did  not  appear  in  the 
schedule).  As  indicated  in  the  figure,  the  greedy  schedule  spends  the  majority  of  its  time 
running  S  ATP  LAN,  the  solver  that  performed  best  in  the  competition.  However,  the  first 
two  solvers  that  the  greedy  schedule  runs  are  CPT2  and  FDP.  Although  these  solvers  did 
not  perform  as  well  as  S  ATP  LAN  in  the  competition,  there  are  some  instances  that  they 
are  able  to  solve  very  quickly,  making  it  beneficial  to  perform  short  runs  of  these  solvers 
initially. 


Time 


Figure  3.6:  Number  of  benchmark  instances  from  the  IPC-5  A. I.  planning  competition 
solved  by  various  solvers  and  schedules,  as  a  function  of  time. 


Figure  3.6  shows  the  number  of  instances  solved  by  various  solvers  as  a  function  of 
time:  the  six  individual  solvers,  as  well  as  the  greedy  and  parallel  schedules,  executed  in 
the  suspend-and-resume  model.  As  indicated  in  the  figure,  the  greedy  schedule  not  only 
outperforms  the  other  schedules  and  solvers  in  terms  of  average  CPU  time,  but  outperforms 
them  in  terms  of  the  number  of  instances  solved  within  time  T,  for  almost  all  choices  of 
T. 
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3.9.5  Experiments  with  optimization  heuristics 

As  described  in  §3.7,  the  results  of  this  chapter  can  be  applied  to  optimization  as  well  as 
decision  problems.  Recall  from  §3.7  that  the  idea  of  this  approach  was  to  redefine  the 
“cost”  of  a  schedule  to  reflect  how  solution  quality  changes  as  a  function  of  time,  and  that 
all  our  results  carry  over  to  this  more  general  notion  of  schedule  cost. 

In  this  section,  we  demonstrate  the  power  of  this  idea  using  data  from  the  optimization 
tracks  of  the  2007  pseudo-Boolean  evaluation  and  the  CPAI’06  constraint  programming 
competition.7  Our  experimental  procedure  is  identical  to  the  one  used  in  the  previous 
section,  except  that  we  now  define  the  “cost”  of  a  schedule  to  be  the  average  of  three 
quantities: 

1 .  the  time  the  schedule  takes  to  find  a  feasible  solution, 

2.  the  time  the  schedule  takes  to  find  an  optimal  solution,  and 

3.  the  time  the  schedules  takes  to  prove  optimality  (or  to  prove  that  the  problem  is 
infeasible).8 

As  discussed  in  §3.9.2,  we  discarded  instances  where  none  of  the  solvers  were  able  to  find 
a  provably  optimal  solution  (for  these  instances,  we  would  not  be  able  to  evaluate  the  last 
two  of  the  three  quantities  just  listed). 

We  present  detailed  results  for  instances  in  the  “small  integers,  linear  constraints”  cat¬ 
egory  of  the  PB’07  evaluation,  then  summarize  the  results  for  the  other  categories  and 
competitions.  In  many  cases,  our  experiments  yield  schedules  that  outperform  each  of  the 
solvers  entered  in  the  competition  simultaneously  in  terms  of  each  of  the  three  objectives 
just  discussed. 

Table  3.5  shows  the  behavior  of  various  solvers  in  terms  of  each  of  these  three  ob¬ 
jectives,  for  instances  in  the  “small  integers,  linear  constraints”  category  of  the  PB’07 
evaluation.  As  indicated  in  the  table,  no  one  solver  is  the  best  in  terms  of  all  three  ob¬ 
jectives:  bsolo3.0.17  is  best  in  terms  of  the  average  time  required  to  find  an  optimal 
solution  and  the  average  time  to  prove  optimality,  while  sat 4  jPseudoCP  is  about  1.5 
times  slower  at  proving  optimality  but  is  about  1.6  times  faster  at  finding  a  feasible  solu¬ 
tion. 

7We  do  not  consider  optimization  problems  from  the  Max-SAT  evaluation  because  the  necessary  data 
showing  solver  solution  quality  as  a  function  of  time  was  not  collected  as  part  of  the  Max-SAT  evaluation. 

8The  time  required  to  find  a  provably  optimal  solution  is  equal  to  the  time  required  to  find  an  optimal 
solution  plus  the  time  required  to  prove  that  no  better  solution  exists.  Note  that  the  time  required  to  prove 
optimality  is  often  substantially  larger  than  the  time  required  to  simply  discover  an  optimal  solution. 
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Table  3.5:  Results  for  PB’07,  optimization  problems  with 
small  integers  and  linear  constraints. 


Solver 

Avg.  CPU  (s)  to 
prove  optimality 
(or  infeasibility) 

Avg.  CPU  (s)  to 
find  optimal  so¬ 
lution 

Avg.  CPU  (s)  to 
find  feasible  so¬ 
lution 

Greedy 

[218,345] 

[164,243] 

[32.54,65.09] 

Greedy  ( CV) 

[251,738] 

[200,491] 

[44.73,187] 

Greedy  ( restart ) 

[262,506] 

[190,326] 

[38.79,79.00] 

Greedy  ( restart,  CV) 

[295,1234] 

[236,783] 

[53.34,340] 

Parallel 

[381,1360] 

[288,949] 

[56.94,275] 

Parallel  ( restart ) 

[474,2537] 

[369,1749] 

[87.29,517] 

bsolo3 .0.17 

[629,oo] 

[577,oo] 

[270, oo] 

bsolo3 . 0 . 1 6 

[664,  oo] 

[608, oo] 

[270, oo] 

minisat  +  1 . 14 

[756,oo] 

[714,oo] 

[250, oo] 

Pueblol . 4 

[890, oo] 

[816,oo] 

[229,oo] 

sat4  jPseudoCP 

[961,oo] 

[842,oo] 

[164,oo] 

sat 4  jPseudoCPCls . 

[971,oo] 

[856,oo] 

[167,oo] 

sat 4  jPseudoRes . 

[972,oo] 

[880, oo] 

[270, oo] 

glpPBO . 2 

[735,oo] 

[735,oo] 

[735,oo] 

PBS4_v2 

[1086, oo] 

[974,oo] 

[280, oo] 

PBS4 

[1086, oo] 

[975,oo] 

[281,oo] 

PB-clasp04-10 

[1025, oo] 

[931,oo] 

[410, oo] 

PB-clasp03-23 

[1168,oo] 

[1117,oo] 

[633, oo] 

oree0.1.2  alpha 

[1431,oo] 

[1360, oo] 

[614,oo] 

absconPseudol02 

[1399,oo] 

[1281,oo] 

[812,oo] 

wi ldcat-skc 

[1795,oo] 

[1109, oo] 

[593,oo] 

wi ldcat-rnp 

[1795,oo] 

[1210, oo] 

[702, oo] 

As  shown  in  Table  3.5,  the  greedy  schedule  significantly  outperforms  each  of  the 
solvers  entered  in  the  competition,  simultaneously  in  terms  of  all  three  objectives.  In 
fairness,  we  should  also  note  that  simply  running  all  the  solvers  in  parallel  also  outper¬ 
forms  each  of  the  original  solvers  in  terms  of  all  three  objectives,  although  by  a  smaller 
margin.  However,  the  parallel  schedule  is  inefficient,  in  part  because  many  of  the  solvers 
have  very  similar  behavior  (e.g.,  the  two  versions  of  bsolo  and  sat  4  JPseudoCP).  Ac¬ 
cordingly,  the  performance  of  the  parallel  schedule  is  significantly  worse  than  that  of  the 
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greedy  schedule  (even  when  evaluated  under  leave-one-out  cross-validation)  in  terms  of 
all  three  objectives,  particularly  in  terms  of  the  upper  bounds  on  average  CPU  time. 

Table  3.6  summarizes  the  results  of  all  our  optimization  experiments.  For  each  set 
of  instances  and  for  each  of  the  three  objectives,  we  define  a  speedup  factor  equal  to  the 
(lower  bound  on)  average  CPU  time  required  by  the  fastest  individual  solver  to  achieve  that 
objective,  divided  by  the  corresponding  quantity  for  the  greedy  schedule,  where  the  per¬ 
formance  of  the  greedy  schedule  is  evaluated  under  leave-one-out  cross-validation.  Note 
that  in  general,  the  three  different  speedup  factors  listed  for  each  competition  represent  a 
comparison  against  three  different  solvers. 


Table  3.6:  Speedup  factors  for  experiments  with  optimization  heuristics. 


Competition 

Category 

Speedup  fac¬ 
tor  (proving 
optimality) 

Speedup 
factor  (find¬ 
ing  optimal 
solution) 

Speedup 
factor  (find¬ 
ing  feasible 
solution) 

PB’07 

Opt.  small  inte¬ 
gers 

2.50 

2.89 

3.67 

Opt.  small  inte¬ 
gers,  non-linear 

1.60 

1.30 

1.44 

Opt.  big  integers 

1.18 

1.47 

1.36 

CPAI’06 

Opt.  binary  con¬ 
straints  in  exten¬ 
sion 

1.60 

3.31 

0.96 

Opt.  n-ary  con¬ 
straints  in  exten¬ 
sion 

1.13 

1.33 

0.98 

As  the  table  shows,  our  strongest  results  were  for  the  PB’07  evaluation:  in  all  three 
categories,  we  were  able  to  generate  a  schedule  that  simultaneously  outperformed  each 
of  the  original  solvers  in  terms  of  each  of  the  three  objectives  we  considered:  average 
time  to  find  a  feasible  solution,  average  time  to  find  an  optimal  solution,  and  average  time 
required  to  prove  optimality.  Our  results  for  the  optimization  tracks  of  the  CPAF06  com¬ 
petition  are  qualitatively  similar,  though  not  quite  as  strong.  In  each  of  the  two  categories 
considered  in  these  experiments,  we  obtain  a  schedule  that  simultaneously  outperforms 
each  of  the  original  solvers  in  terms  of  the  time  required  to  find  an  optimal  solution  and 
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the  time  required  to  prove  optimality,  and  simultaneously  performs  almost  as  well  as  the 
best  individual  solver  in  terms  of  the  time  required  to  find  a  feasible  solution. 


3.9.6  Experiments  with  online  algorithms 

In  this  section  we  compare  the  performance  of  various  online  schedule-selection  algo¬ 
rithms  by  using  them  to  combine  solvers  from  the  three  tracks  of  the  2007  SAT  competi¬ 
tion:  industrial,  random,  and  hand-crafted.  Within  each  instance  category,  we  compared 
the  performance  of  the  online  algorithms  to  the  offline  greedy  schedule,  to  the  individual 
solver  with  the  lowest  (lower  bound  on)  average  CPU  Time,  and  to  a  schedule  that  ran 
each  solver  in  parallel  at  equal  strength.  All  the  schedules  considered  in  these  experiments 
are  executed  in  the  suspend-and-resume  model. 

We  compare  the  performance  of  four  online  algorithms: 

1.  Online  greedy  (WMR):  the  online  greedy  algorithm  OG  from  Chapter  2,  run  in  the 
full-information  feedback  model  (i.e.,  after  solving  an  instance,  the  times  required 
by  all  solvers  are  revealed).  When  running  in  the  full-information  feedback  model, 
we  use  the  self-tuning  randomized  weighted  majority  algorithm  of  Auer  &  Gentille 
[6]  as  the  subroutine  experts  algorithm  used  by  OG. 

2.  Online  greedy  (Exp3):  the  online  greedy  algorithm  OG,  run  in  the  opaque  feedback 
model  (i.e.,  after  using  a  schedule  to  solve  an  instance,  we  only  learn  the  CPU  time 
required  by  that  schedule  to  solve  that  instance).  When  running  in  this  feedback 
model,  we  use  a  self-tuning  version  of  the  Exp3  algorithm  [5]  as  the  subroutine 
experts  algorithm  used  by  OG.  Recall  from  §3.6.3  that,  when  all  heuristics  are  de¬ 
terministic,  the  opaque  feedback  model  and  the  partially  transparent  feedback  model 
are  equivalent,  and  thus  the  information  needed  to  compute  the  payoffs  to  Exp3  is 
available. 

3.  WMR:  An  online  algorithm  that  uses  the  self-tuning  randomized  weighted  majority 
algorithm  of  Auer  &  Gentille  [6]  to  select  a  single  heuristic  to  use  in  solving  each 
problem  instance.  Specifically,  we  treat  each  heuristic  as  an  expert  whose  loss  (neg¬ 
ative  payoff)  equals  the  time  required  by  the  heuristic  to  solve  that  instance  (capped 
at  the  competition  time  limit). 

4.  Exp3:  An  online  algorithm  identical  to  the  one  just  described,  except  that  the  self¬ 
tuning  version  of  Exp3  is  used  as  the  experts  algorithm. 
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We  ran  the  online  greedy  algorithm  with  parameter  L  —  25.  The  time  units  used  by  the 
online  greedy  algorithm  are  the  competition  time  limit  divided  by  L.  In  implementing  the 
algorithm,  we  made  use  of  the  two  modifications  described  in  §2.4.6,  namely  (i)  ruling  out 
actions  that  have  already  been  performed  when  sampling  from  the  distribution  returned 
by  each  experts  algorithm  and  (ii)  using  dependent  rather  than  independent  probabilities 
when  converting  the  experts  algorithms’  choices  into  a  schedule. 

Tables  3.7  summarizes  the  results  of  these  experiments.  The  rows  labeled  Online 
greedy  w/features  and  Offline  greedy  w/features  ( CV)  refer  to  experiments  described  in 
the  next  section  in  which  we  make  use  of  instance  features. 

In  each  category,  the  offline  greedy  schedule  (evaluated  under  leave-one-out  cross- 
validation)  outperforms  each  individual  solver  as  well  as  the  naive  parallel  schedule,  both 
in  terms  of  the  number  of  instances  solved  within  the  time  limit  and  in  terms  of  (upper  and 
lower  bounds  on)  average  CPU  time.  The  same  is  true  of  the  online  greedy  schedule  in  the 
full  information  setting.  Under  the  opaque  feedback  model,  the  performance  of  the  online 
greedy  algorithm  is  not  as  strong.  In  some  cases  it  does  not  outperform  the  best  individual 
solver,  while  in  other  cases  it  does  not  outperform  the  naive  parallel  schedule.  In  the 
section  that  follows,  we  evaluate  how  the  various  online  algorithms  behave  asymptotically, 
as  the  number  of  instances  grows  large. 


Table  3.7:  Results  for  the  SAT  2007  competition. 


Category 

Solver 

Avg.  CPU  (s) 

Num.  solved 

(^Instances) 

[lower,  upper] 

Industrial  (166) 

Offline  greedy  w/features  (CV) 

[1872,3180] 

151 

Online  greedy  w/features 

[2014,4336] 

149 

Online  greedy  (WMR) 

[2215,4196] 

149 

Fastest  solver 

[2438,oo] 

139 

Offline  greedy  ( CV) 

[2464,4271] 

148 

WMR 

[2617,oo] 

139 

Online  greedy  ( Exp 3 ) 

[2765,6858] 

134 

Parallel 

[3176,7003] 

132 

Exp3 

[3574,oo] 

120 

Random  (411) 

Offline  greedy  w/features  (CV) 

[963,2204] 

380 

Online  greedy  w/features 

[1044,3262] 

365 

Online  greedy  (WMR) 

[1304,4261] 

347 

Offline  greedy  ( CV) 

[1337,3252] 

344 

continued  on  next  page. . . 
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Table  3.7  (continued  from  previous  page) 


Category 

Solver 

Avg.  CPU  (s) 

Num.  solved 

(instances) 

[lower, upper] 

Parallel 

[1775,7571] 

302 

Online  greedy  (Exp3) 

[2050,8127] 

294 

Fastest  solver 

[2157,oo] 

252 

WMR 

[2184,oo] 

255 

Exp3 

[2835,oo] 

191 

Hand-crafted  (129) 

Offline  greedy  w/features  ( CV) 

[1237,2518] 

113 

Offline  greedy  ( CV) 

[1344,2715] 

110 

Online  greedy  w/features 

[1430,2947] 

108 

Online  greedy  (WMR) 

[1513,3452] 

107 

Fastest  solver 

[1847,oo] 

98 

Parallel 

[1855,4866] 

95 

WMR 

[1903,oo] 

96 

Exp3 

[201 2,oo  ] 

96 

Online  greedy  (Exp3) 

[2041,5148] 
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Asymptotic  behavior  of  online  algorithms 

To  more  thoroughly  evaluate  the  behavior  of  the  online  algorithms  in  the  limited-feedback 
setting,  we  performed  experiments  involving  a  much  larger  number  of  benchmark  in¬ 
stances.  To  do  so,  we  sampled  100,000  benchmark  instances  independently  (with  re¬ 
placement)  from  the  set  of  411  benchmarks  from  the  random  category  of  the  SAT  2007 
competition.  We  then  ran  each  of  the  online  algorithms  on  this  sequence  of  100,000  bench¬ 
marks. 

It  is  worth  pointing  out  that,  if  we  knew  that  each  instance  in  the  sequence  was  being 
drawn  independently  from  a  fixed  distribution  (as  is  the  case  in  these  experiments),  we 
could  design  simpler  algorithms  and  achieve  better  performance  (e.g.,  by  using  the  first  m 
instances  as  training  data,  for  some  appropriate  value  of  m,  and  then  using  the  schedule 
that  performs  best  on  the  training  instances  to  solve  the  remaining  instances).  However, 
the  intent  of  these  experiments  it  to  evaluate  the  behavior  of  the  online  algorithms  on  long 
sequences  of  instances,  which  will  not  in  general  have  this  property. 

In  addition  to  evaluating  the  online  algorithms  considered  in  the  previous  section,  we 
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evaluated  the  variant  of  the  online  greedy  algorithm  designed  for  operation  in  the  priced 
feedback  model,  as  described  in  §3.6.3.  Recall  that  this  model  works  as  follows:  when 
given  an  instance,  the  online  algorithm  may  either  select  a  schedule,  or  may  choose  to  ex¬ 
plore  by  running  all  the  solvers  until  they  find  a  solution  (or  until  the  competition  time  limit 
expires).  As  described  in  §3.6.3,  the  online  greedy  algorithm  OGp  can  be  applied  in  this 
setting,  and  simply  explores  with  a  fixed  probability  on  each  instance.  We  experimented 
with  three  values  of  the  exploration  probability:  0.1,  0.01,  and  0.001. 


Num.  instances 


Figure  3.7:  Performance  of  various  online  algorithms  on  instances  drawn  at  random  from 
the  set  of  SAT  2007  benchmarks  instances  in  the  random  category. 

Figure  3.7  shows  the  (running)  average  CPU  time  for  various  online  algorithms  and 
offline  schedules  as  a  function  of  the  number  of  instances  encountered.  The  algorithms 
labeled  “Online  greedy  (p  =  7)”  refer  to  the  online  greedy  algorithm,  run  with  exploration 
probability  7.  For  each  of  the  three  values  of  7  we  display  two  curves:  one  for  the  average 
CPU  time  on  the  non-exploration  rounds,  and  the  other  (marked  “Online  greedy  (p  =  7) 
+  exploration  cost”)  the  overall  average  CPU  time,  including  the  time  required  to  run  each 
solver  on  the  exploration  rounds. 
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As  Figure  3.7  shows,  all  the  online  greedy  algorithms  eventually  outperform  both  the 
best  individual  solver  and  the  parallel  schedule  in  terms  of  average  CPU  time.  In  the  full- 
information  setting,  this  happens  after  a  few  dozen  instances,  while  in  the  limited  feedback 
settings  it  takes  longer.  In  the  limited  feedback  settings,  each  of  the  online  greedy  algo¬ 
rithms  outperforms  the  best  individual  solver  after  less  than  1000  instances,  while  the  num¬ 
ber  of  instances  required  to  overtake  the  parallel  schedule  ranges  from  about  700  (when 
the  exploration  probability  is  0.1,  and  exploration  costs  are  ignored)  to  35,000  (when  the 
exploration  probability  is  0.001). 

Not  surprisingly,  the  online  algorithms  that  select  a  single  solver  to  run  on  each  in¬ 
stance  do  not  outperform  the  best  individual  solver,  although  they  approach  its  perfor¬ 
mance  as  the  number  of  instances  grows  large. 

3.9.7  Experiments  with  instance  features 

In  §3.8,  we  discussed  how  instance-specific  features  may  be  exploited  to  make  a  better 
choice  of  schedule  to  use  in  solving  a  particular  instance.  In  this  section,  we  present 
experiments  that  demonstrate  the  additional  speedups  that  can  be  obtained  using  this  ap¬ 
proach. 

Recall  from  §3.8  that  in  this  approach  we  create,  for  each  feature,  a  separate  copy  of 
our  online  schedule-selection  algorithm  that  is  only  run  on  instances  that  have  that  feature. 
We  then  use  a  “sleeping  experts  algorithm”  to  select  among  the  schedules  returned  by  the 
various  copies.  We  use  OGe  to  denote  the  online  algorithm  that  results  from  composing 
the  sleeping  experts  algorithm  of  Blum  and  Mansour  [12]  with  the  online  greedy  algorithm 
OG  in  this  way.  In  other  words,  OGse  is  the  algorithm  SE  from  §3.8,  where  the  algorithm 
of  Blum  and  Mansour  [12]  is  the  subroutine  sleeping  experts  algorithm  and  OG  is  the 
subroutine  online  schedule  selection  algorithm. 

Features  used 

Selecting  a  set  of  informative  features  for  each  of  the  eight  problem  domains  considered  in 
this  chapter  would  be  a  challenging  research  project  in  itself.  For  this  reason,  take  a  very 
simple  and  domain-independent  approach  to  feature  selection.  We  made  use  of  two  types 
of  features: 

1 .  Features  based  on  competition  benchmark  directory  structure.  We  compute  a  num¬ 
ber  of  features  of  each  instance  x  based  on  the  directory  in  which  x  is  stored.  Specif¬ 
ically,  for  each  directory  we  create  a  Boolean  feature  that  is  true  if  and  only  if  the 
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instance  resides  somewhere  within  that  directory’s  subtree  (so  if  x  is  stored  d  lev¬ 
els  deep  there  will  be  d  features  that  evaluate  to  true  for  x).  Note  that  these  fea¬ 
tures  are  potentially  quite  useful;  for  example  the  instances  stored  in  the  directory 
GRAPHS /WMAXCUT/SP INGLASS  seem  likely  to  have  common  features  which 
some  heuristic  might  be  able  to  exploit.  In  an  attempt  to  make  our  experiments 
fair,  we  manually  removed  any  directory  names  that  we  felt  would  give  away  too 
much  information  (e.g.,  we  would  remove  a  directory  called  HARD  .INSTANCES). 

2.  Features  based  on  instance  annotations.  In  some  cases,  the  instances  contained 
specific  annotations  that  described  the  problem.  Specifically,  the  instances  used  in 
the  CASC-J3  theorem  proving  competition  specified  the  field  of  mathematics  that 
the  theorem  came  from  (e.g.,  general  algebra,  geometry,  theory  of  computation). 

Additionally,  we  include  a  Boolean  feature  that  evaluates  to  true  for  all  instances.  This 
ensures  that  the  online  algorithm  OGse  maintains  regret  bounds  that  hold  for  all  instances, 
in  addition  to  its  per-feature  regret  bounds. 

Note  that  in  general  there  will  be  many  Boolean  features  that  are  true  for  a  particular 
instance,  and  so  the  online  algorithm  must  discover  which  features  to  pay  attention  to. 


Cross-validation 

In  addition  to  evaluating  the  online  algorithm  OGse,  we  perform  experiments  in  which  we 
evaluate  the  use  of  features  under  leave-one-out  cross-validation.  Unlike  in  our  previous 
experiments,  it  is  not  immediately  obvious  how  to  perform  leave-one-out  cross-validation 
in  the  presence  of  features.  One  possible  approach  would  be  the  following:  when  leaving 
out  each  instance  x,  run  a  copy  of  OG  Sf  on  the  instances  one  at  a  time,  with  instance  x  pre¬ 
sented  last.  Unfortunately,  the  computation  time  required  by  this  approach  is  prohibitive 
for  some  solver  competitions  (some  of  which  have  thousands  of  instances  within  a  single 
category).  Instead,  we  adopt  a  simpler  approach  that  is  designed  to  achieve  roughly  the 
same  effect. 

Our  approach  is  as  follows.  For  each  instance  x,  we  remove  x  from  the  set  of  instances 
and  use  the  remaining  instances  as  training  data.  For  each  feature  j,  we  use  the  subset  of 
training  instances  that  have  feature  j  as  input  to  the  offline  greedy  approximation  algo¬ 
rithm,  producing  a  schedule  Sj.  Then,  for  each  feature  j,  we  create  a  “sleeping  expert” 
that  recommends  schedule  Sj  on  instances  that  have  feature  j.  We  then  run  the  sleeping 
experts  algorithm  of  Blum  and  Mansour  [12]  on  all  n  instances,  with  x  presented  last,  to 
obtain  a  schedule  to  use  in  solving  x.  Note  that  in  the  degenerate  case  where  we  have  only 
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a  single  feature  and  it  is  true  for  all  instances,  this  cross-validation  procedure  is  identical 
to  the  one  used  in  previous  sections. 


Results 

Table  3.7  (in  the  previous  section)  summarizes  the  results  of  our  experiments  for  the  indus¬ 
trial,  random,  and  hand-crafted  categories  of  the  SAT  2007  competition.  The  rows  labeled 
Online  greedy  w/features  refer  to  the  online  algorithm  OGse.  The  rows  labeled  Offline 
greedy  w/features  (CV)  refer  to  the  cross-validation  procedure  just  described.  As  the  table 
shows,  the  use  of  features  consistently  improves  performance,  both  in  the  online  setting 
and  under  leave-one-out  cross-validation.  In  the  random  category,  for  example,  using  fea¬ 
tures  improves  (the  lower  bound  on)  the  average  CPU  time  of  the  offline  greedy  algorithm 
(evaluated  under  leave-one-out  cross-validation)  by  a  factor  of  1.43,  and  improves  the  av¬ 
erage  CPU  time  of  the  online  greedy  algorithm  by  a  factor  of  1.25. 


Table  3.8:  Average  CPU  time  (lower  bounds)  required  by  different  schedules  and  heuris¬ 
tics  to  solve  instances  from  the  random  category  of  the  2007  SAT  competition.  Bold 
numbers  indicate  the  (strictly)  smallest  value  in  a  row. 


Feature  (instances) 

Best  heuristic 
for  feature 

Greedy  (CV) 

Greedy  (CV) 
w/features 

2+p  (105) 

1100  (1 March  KS ) 

918 

885 

2+p/pO  .  7  (36) 

1607  ( SATzilla ) 

1286 

1368 

2+p/pO  .  8  (36) 

847  {March  KS) 

850 

728 

2+p/pO  .  9  (33) 

678  {March  KS) 

590 

531 

LargeSize  (130) 

2276  {adaptg2wsat+) 

2641 

1571 

LargeSize/3SAT  (42) 

1016  {gnovelty+) 

2588 

1016 

LargeSize/5SAT  (57) 

921  {adaptg2wsat+) 

2215 

1164 

LargeSize/7SAT  (31) 

2532  {ranov) 

3496 

3069 

OnThreshold  (176) 

764  {SATzilla) 

625 

435 

OnThreshold/3SAT  (55) 

404  {SATzilla) 

606 

397 

OnThreshold/ 5SAT  (60) 

601  {SATzilla) 

624 

523 

OnThreshold/7SAT  (61) 

1034  {March  KS) 

643 

382 

Table  3.8  illustrates  in  greater  detail  the  power  of  using  features  on  instances  from  the 
random  category.  The  first  column  lists  the  feature  along  with  the  number  of  instances  for 
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which  the  feature  was  true.  The  second  column  lists,  for  each  feature,  the  minimum  aver¬ 
age  CPU  time  required  by  any  single  heuristic,  where  the  average  is  computed  only  over 
instances  that  have  that  feature.  The  third  and  fourth  columns  list  the  average  CPU  time 
for  the  greedy  schedule  (evaluated  under  cross-validation),  with  and  without  the  benefit 
of  features,  respectively.  Bold  numbers  indicate  the  minimum  average  CPU  time  within 
a  row.  As  the  table  shows,  the  use  of  features  substantially  improves  the  performance  of 
the  greedy  schedule  in  many  cases.  In  nine  of  the  twelve  cases,  the  greedy  schedule  with 
features  outperforms  the  best  solver  for  instances  that  had  that  feature.  In  contrast,  for 
the  greedy  schedule  constructed  without  the  use  of  features,  this  is  only  true  in  five  out  of 
twelve  cases. 


3.9.8  Summary  of  experimental  evaluation 

In  this  section  we  summarize  our  experimental  results  for  the  eight  solver  competitions 
described  in  §3.9.1.  For  each  category  of  each  competition,  we  compare  the  performance 
of  the  offline  greedy  schedule  (evaluated  under  leave-one-out  cross-validation)  to  that  of 
the  solver  that  performed  best  in  terms  of  average  CPU  time.  We  quantify  the  performance 
improvement  achieved  by  the  greedy  schedule  by  calculating  two  “speedup  factors”:  the 
first  equals  the  ratio  of  (a  lower  bound  on)  the  average  CPU  time  of  the  best  individual 
solver  to  that  of  the  schedule  produced  by  the  greedy  algorithm,  while  the  second  equals 
the  ratio  of  the  median  CPU  time  of  the  best  individual  solver  to  that  of  the  greedy  schedule 
(both  speedup  factors  compare  the  greedy  schedule  to  the  same  individual  solver,  namely 
the  one  that  performed  best  in  terms  of  average  CPU  time).  We  run  the  greedy  algorithm 
both  with  and  without  the  use  of  features,  as  described  in  §3.9.7.  All  CPU  times  for 
the  offline  greedy  algorithm  are  calculated  using  leave-one-out  cross-validation,  to  avoid 
results  that  are  misleading  due  to  overfitting. 

Table  3.9  shows  the  results.  In  30  out  of  44  cases,  the  greedy  schedule  (evaluated  un¬ 
der  cross-validation)  outperforms  the  best  individual  solver  in  terms  of  average  CPU  time, 
while  in  14  cases,  the  greedy  schedule  performs  worse  than  the  best  individual  solver,  due 
to  overfitting.  Generally  speaking,  overfitting  occurs  for  categories  in  which  the  number 
of  instances  is  relatively  small.  In  terms  of  average  CPU  time,  the  performance  improve¬ 
ments  are  less  than  a  factor  of  10  in  all  but  one  case.  In  terms  of  median  CPU  time,  the 
performance  improvements  are  more  dramatic:  the  greedy  schedule  outperforms  the  best 
individual  solver  by  more  than  a  factor  of  10  in  several  cases.  This  difference  is  not  all  that 
surprising,  given  that  the  “best”  individual  solver  was  defined  as  the  one  with  minimum 
average  CPU  time,  and  not  the  one  with  minimum  median  CPU  time. 

The  use  of  features  usually  but  not  always  improved  performance.  In  30  of  the  44  cases 
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listed  in  Table  3.9,  the  use  of  features  led  to  a  larger  speedup  in  average  CPU  time;  in  10 
cases  it  was  harmful;  and  in  4  cases  it  had  no  effect.  In  cases  where  the  use  of  features 
is  harmful,  the  harm  is  again  due  to  overfitting.  Again,  overfitting  occurs  primarily  in 
cases  where  the  number  of  instances  is  relatively  small.  Again,  note  that  the  performance 
improvement  in  terms  of  median  CPU  time  is  generally  larger  than  the  performance  in 
terms  of  average  CPU  time,  and  is  larger  than  a  factor  of  10  in  several  cases. 


Table  3.9:  Speedup  factors  for  various  solver  competitions. 


Competition 

Category  (^Instances) 

Speedup 

Speedup 

w/features 

Mean 

Median 

Mean 

Median 

CASC-21 

CNF  (191) 

1.24 

0.97 

1.45 

1.46 

EPR  (98) 

0.58 

1.00 

0.56 

0.98 

FNT  (100) 

3.78 

2.94 

3.47 

2.94 

FOF  (295) 

2.06 

90.0 

2.15 

90.0 

SAT  (100) 

4.83 

0.98 

5.49 

0.98 

UEQ  (93) 

0.99 

1.00 

0.99 

1.00 

CPAI’06 

Binary  ext.  (1140) 

1.39 

1.06 

1.37 

1.00 

Binary  int.  (698) 

3.03 

2.36 

1.97 

1.66 

Global  (127) 

0.28 

1.00 

0.28 

0.94 

Opt.  binary  ext.  (619) 

2.06 

1.36 

1.57 

1.11 

Opt.  n-ary  ext.  (97) 

1.55 

1.64 

1.23 

0.94 

N-ary  ext.  (312) 

1.36 

18.6 

1.19 

18.6 

N-ary  int.  (736) 

2.60 

41.8 

2.10 

29.5 

IPC-5 

Optimal  planning  (110) 

1.78 

2.89 

1.61 

2.50 

MaxSAT-2007 

Max-SAT  (790) 

0.99 

0.97 

0.98 

1.00 

Partial  Max-SAT  (647) 

1.68 

0.89 

1.31 

0.94 

Weighted  Max-SAT  (308) 

1.15 

1.55 

0.82 

1.76 

Weighted  Partial  (702) 

1.49 

1.58 

1.15 

0.81 

PB’07 

Opt.  big  ints.  (124) 

1.11 

1.04 

1.05 

0.95 

Opt.  small  ints.  (396) 

3.09 

6.77 

2.71 

4.08 

Opt.  small  ints.  non-lin.  (280) 

2.32 

1.01 

2.10 

0.96 

Pure  satisfiability  (88) 

1.24 

1.09 

0.98 

0.97 

Small  ints.  (216) 

3.19 

69.2 

2.56 

36.7 

QBFEVAL’07 

Formal  verification  (728) 

1.91 

3.04 

1.52 

2.36 

Horn  clause  formulas  (287) 

1.06 

1.00 

1.06 

1.00 

continued  on  next  page. . . 
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Table  3.9  (continued  from  previous  page) 


Competition 

Category  (instances) 

Speedup 

w/features 

Mean  Median 

Speedup 

Mean  Median 

Miscellanea  (67) 

2.19 

3.72 

2.19 

3.72 

Non_prenex_non_cnf  (81) 

0.81 

0.76 

0.81 

0.79 

Planning  (80) 

1.37 

0.92 

1.28 

1.00 

SAT  2007 

And-Inverter  Graphs  (263) 

1.26 

1.00 

1.11 

1.00 

Hand-crafted  (129) 

1.49 

3.24 

1.37 

3.24 

Industrial  (166) 

1.30 

1.42 

0.99 

1.18 

Random  (411) 

2.24 

7.27 

1.61 

5.31 

SMT-COMP’07 

AUFLIA  (192) 

2.62 

1.00 

2.64 

0.99 

AUFLIRA  (193) 

15.1 

1.00 

15.1 

1.00 

QF_AUFBV  (187) 

1.00 

1.00 

1.00 

1.00 

QF_AUFLIA  (206) 

1.32 

1.00 

1.05 

1.00 

QFJ3V  (200) 

1.94 

1.00 

1.97 

1.00 

QFJDL  (186) 

1.01 

1.00 

1.00 

0.98 

QFJLIA  (186) 

0.33 

10.0 

0.95 

10.0 

QF_LRA  (202) 

0.92 

1.00 

0.82 

1.00 

QF_RDL  (168) 

0.90 

1.00 

0.70 

1.00 

QF_UF  (199) 

2.17 

1.98 

2.29 

1.98 

QFTJFIDL  (201) 

0.85 

0.98 

0.85 

0.98 

QF  UFLIA(llO) 

0.25 

1.00 

0.25 

1.00 

3.9.9  Experiments  with  restart  schedules 

In  our  experiments  so  far,  we  have  only  considered  deterministic  heuristics.  In  this  section 
we  consider  randomized  heuristics.  Specifically,  we  consider  the  problem  of  constructing 
a  single  restart  schedule  to  use  to  solve  a  set  of  problem  instances  via  a  single  Las  Vegas 
algorithm. 

Following  Gagliolo  &  Schmidhuber  [31],  we  evaluate  our  algorithm  for  constructing 
restart  schedules  using  the  SAT  solver  satz-rand.  We  note  that  satz-rand  is  at  this 
point  a  relatively  old  SAT  solver.  However  it  has  the  following  key  feature:  successive 
runs  of  satz-rand  on  the  same  problem  instance  are  independent ,  as  required  by  our 
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theoretical  results.  More  modern  solvers  (e.g.,  MiniSat)  also  make  use  of  restarts  but 
maintain  a  repository  of  conflict  clauses  that  is  shared  among  successive  runs  on  the  same 
instance,  violating  this  independence  assumption. 

To  generate  a  set  of  benchmark  formulae,  we  use  the  instance  generator  supplied  with 
blackbox  [46]  to  generate  80  random  logistics  planning  problems,  using  the  same  param¬ 
eters  that  were  used  to  generate  the  instance  logistics,  d  from  the  paper  by  Gomes 
et  al.  [35]. 9  We  then  used  S  ATP  LAN  to  find  an  optimal  plan  for  each  instance,  and  saved 
the  Boolean  formulae  it  generated.  This  yielded  a  total  of  242  Boolean  formulae.10  We 
then  performed  B  =  1000  runs  of  satz-rand  on  each  formula,  where  the  ith  run  was 
performed  with  a  time  limit  of  j  as  per  the  discussion  in  §3.5.2. 

We  evaluated  several  different  restart  schedules: 

1.  the  schedule  returned  by  the  offline  greedy  approximation  algorithm, 

2.  uniform  schedules  of  the  form  ( t ,  t,t, . . .)  for  each  t  £  { 1,2, ,  B}, 

3.  geometric  restart  schedules  of  the  form  (/3°,  (31,  /32, . . .)  for  each  f3  G  {1. lfc  :  1  < 
k  <  flog^BU,  and 

4.  Luby’s  universal  restart  schedule. 

We  estimated  the  expected  CPU  time  required  by  each  schedule  using  the  “refined  estima¬ 
tion  procedure”  described  in  §3.5.2. 

Table  3.10  gives  the  average  CPU  time  required  by  each  of  the  schedules  we  evaluated. 
Because  run  lengths  were  capped  at  1000  seconds,  the  values  in  this  table  are  lower  bounds 
on  the  (estimated)  expected  running  time  of  a  schedule  on  problem  instances  drawn  from 
the  distribution  used  in  these  experiments.  In  terms  of  these  lower  bounds,  the  schedule 
returned  by  the  greedy  approximation  algorithm  had  the  smallest  mean  running  time.  The 
greedy  schedule  was  1.7  times  faster  than  Luby’s  universal  schedule,  1.5  times  faster  than 
the  best  uniform  schedule  (which  used  threshold  t  =  85),  and  1.1  times  faster  than  the 
best  geometric  schedule  (which  set  /3  ~  1.6).  The  average  CPU  time  for  a  schedule  that 
performed  no  restarts  was  about  3.4  times  that  of  the  greedy  schedule  in  terms  of  the  lower 
bounds  on  average  CPU  time,  but  is  likely  to  be  much  worse  in  terms  of  actual  expected 
running  time  (it  is  likely  that  some  runs  would  take  much  longer  than  1000  seconds  if  all 
the  runs  were  allowed  to  finish). 

''The  parameters  are:  9  packages,  5  cities,  2  planes,  3  locations  per  city,  1  truck  per  city,  and  9  goals. 

10The  number  of  generated  formulae  is  less  than  the  sum  of  the  minimum  plan  lengths,  because  SATPLAN 
can  trivially  reject  some  plan  lengths  without  invoking  a  SAT  solver. 
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Table  3.10:  Performance  of  various  restart  schedules  for  running  satz-rand  on  a  set  of 
Boolean  formulae  derived  from  random  logistics  planning  benchmarks. 


Restart  schedule 

Avg.  CPU  (s) 

Greedy  schedule 

21.9 

Greedy  schedule  (CV) 

22.8 

Best  geometric  schedule 

23.9 

Best  uniform  schedule 

33.9 

Luby  \s  universal  schedule 

37.2 

No  restarts 

74.1 

Examining  Table  3.10,  one  may  be  concerned  that  the  greedy  approximation  algorithm 
was  run  using  the  same  estimated  run  length  distributions  that  were  later  used  to  estimate 
its  expected  CPU  time.  To  address  the  possibility  of  overfitting,  we  also  evaluated  the 
greedy  algorithm  using  leave-one-out  cross-validation.  The  estimated  average  CPU  time 
increased  by  about  4%  under  leave-one-out  cross-validation. 

3.9.10  Ways  to  improve  our  experimental  results 

The  results  of  the  experiments  presented  in  this  chapter  could  potentially  be  improved  in 
at  least  two  ways: 

1.  Sharing  information  among  heuristics.  In  the  experiments  performed  in  this  chapter, 
each  heuristic  executes  independently.  In  practice,  during  the  process  of  solving  an 
instance,  one  heuristic  may  discover  information  that  could  be  useful  to  share  with 
other  heuristics.  For  example,  when  solving  optimization  problems,  the  heuristics 
could  share  upper  and  lower  bounds  on  the  optimal  objective  function  value.  When 
solving  decision  problems  such  as  Boolean  satisfiability  or  constraint  satisfaction, 
the  runs  could  maintain  a  common  repository  of  learned  conflict  clauses. 

2.  Monitoring  progress  of  heuristics.  The  schedules  considered  in  this  chapter  simply 
run  a  heuristic  for  a  certain  amount  of  time,  without  monitoring  the  heuristic  to  see 
whether  it  appears  to  be  close  to  producing  an  answer.  In  practice,  it  may  be  pos¬ 
sible  to  predict  a  heuristic’s  remaining  running  time  based  on  its  current  state.  For 
example,  if  the  heuristic  makes  use  of  chronological  backtracking  one  could  exam¬ 
ine  how  much  of  the  search  tree  has  already  been  pruned.  One  could  also  leverage 
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existing  techniques  for  deliberation  control  (e.g.,  [58,  72]).  Exploiting  information 
of  this  sort  is  an  interesting  prospect,  both  from  an  experimental  and  a  theoretical 
point  of  view. 


3.10  Conclusions 

This  chapter  presented  algorithms  for  combining  multiple  heuristics  in  offline  and  online 
settings.  Experimentally,  we  used  data  from  recent  solver  competitions  to  show  that,  by 
combining  heuristics,  we  can  improve  the  performance  of  state-of-the-art  solvers  in  several 
problem  domains.  Our  experimental  evaluation  considered  heuristics  for  optimization  as 
well  as  decision  problems,  as  well  as  a  randomized  heuristic,  and  showed  that  instance- 
specific  features  can  be  exploited  to  obtain  additional  performance  improvements. 
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Chapter  4 

Using  Decision  Procedures  Efficiently 
for  Optimization 

4.1  Introduction 


Optimization  problems  are  often  solved  by  making  repeated  calls  to  a  decision  procedure 
that  answers  questions  of  the  form  “Does  there  exist  a  solution  with  cost  at  most  kV .  Each 
query  to  the  decision  procedure  can  be  represented  as  a  pair  (k,  t),  where  t  is  a  bound  on 
the  CPU  time  the  decision  procedure  may  consume  in  answering  the  question.  The  result 
of  a  query  is  either  a  (provably  correct)  “yes”  or  “no”  answer  or  a  timeout.  A  query  strategy 
is  a  rule  for  determining  the  next  query  (k,  t)  as  a  function  of  the  responses  to  previous 
queries. 

The  performance  of  a  query  strategy  can  be  measured  in  several  ways.  Given  a  fixed 
query  strategy  and  a  fixed  minimization  problem,  let  u(T )  denote  the  upper  bound  (i.e.,  the 
smallest  k  that  elicited  a  “yes”  response)  obtained  by  running  the  query  strategy  for  a  total 
of  T  time  units;  and  let  1{T)  be  the  corresponding  lower  bound.  A  natural  goal  is  for  u(T ) 
to  decrease  as  quickly  as  possible.  Alternatively,  we  might  want  to  achieve  u(T )  <  a-l(T) 
in  the  minimum  possible  time  for  some  desired  approximation  ratio  a  >  1. 

In  this  chapter  we  study  the  problem  of  designing  query  strategies.  Our  goal  is  to 
devise  strategies  that  do  well  with  respect  to  natural  performance  criteria  such  as  the  ones 
just  described,  when  applied  to  decision  procedures  whose  behavior  (i.e.,  how  the  required 
CPU  time  varies  as  a  function  of  k)  is  typical  of  the  procedures  used  in  practice. 

The  results  in  this  chapter  are  based  on  a  conference  paper  [83]. 
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4.1.1  Motivations 


A.I.  planning  is  the  problem  of  finding  a  sequence  of  actions  (called  a  plan)  that  leads  from 
a  starting  state  to  a  desired  goal  state,  according  to  some  formally-specified  model  of  how 
actions  affect  the  state  of  the  world.  The  makespan  of  a  plan  is  the  number  of  steps  in  the 
plan,  treating  actions  that  can  be  performed  simultaneously  as  a  single  step.  In  optimal 
planning ,  the  goal  is  to  find  a  plan  with  (provably)  minimum  makespan. 

The  two  winners  from  the  optimal  track  of  last  year’s  International  Planning  Compe¬ 
tition  were  SATPLAN  [47]  and  Maxplan  [89].  Both  planners  find  a  minimum-makespan 
plan  by  making  a  series  of  calls  to  a  SAT  solver,  where  each  call  determines  whether  there 
exists  a  feasible  plan  of  makespan  <  k  (where  the  value  of  k  varies  across  calls).  One  of 
the  differences  between  the  two  planners  is  that  SATPLAN  uses  the  ramp-up  query  strat¬ 
egy  (in  which  the  ith  query  is  (i,  oo)),  whereas  Maxplan  uses  the  ramp-down  strategy  (in 
which  the  ith  query  is  ( U  —  i,  oo),  where  U  is  an  upper  bound  obtained  using  heuristics). 
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Figure  4.1:  Behavior  of  the  SAT  solver  siege  running  on  formulae  generated  by 
SATPLAN  to  solve  instance  pi  7  from  the  pathways  domain  of  the  2006  International 
Planning  Competition. 

To  appreciate  the  importance  of  query  strategies,  consider  Figure  4.1,  which  shows  the 
CPU  time  required  by  siege  (the  SAT  solver  used  by  SATPLAN)  as  a  function  of  the 
makespan  bound  k,  on  a  benchmark  instance  from  the  competition.  For  most  values  of 
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k,  the  solver  terminates  in  under  one  minute;  for  k  —  19  and  k  =  21,  the  solver  requires 
10-20  minutes;  and  for  k  =  20,  the  solver  was  run  for  over  100  hours  without  returning 
an  answer.  Because  only  the  queries  with  k  >  21  return  a  “yes”  answer,  the  ramp-up 
query  strategy  (used  by  SATPLAN)  does  not  find  a  feasible  plan  after  running  for  100 
hours,  while  the  ramp-down  strategy  returns  a  feasible  plan  but  does  not  yield  any  non¬ 
trivial  lower  bounds  on  the  optimum  makespan.  In  this  example,  the  time  required  by 
any  query  strategy  to  obtain  a  provably  optimal  plan  is  dominated  by  the  time  required  to 
run  the  decision  procedure  with  input  k  =  20.  On  the  other  hand,  executing  the  queries 
(18,  oo )  and  (23,  oo)  takes  less  than  two  minutes  and  yields  a  plan  whose  makespan  is 
provably  at  most  y||-j-  1.21  times  optimal.  Thus,  the  choice  of  query  strategy  has  a 

dramatic  effect  on  the  time  required  to  obtain  a  provably  approximately  optimal  solution. 
For  planning  problems  where  provably  optimal  plans  are  currently  out  of  reach,  obtaining 
provably  approximately  optimal  plans  quickly  is  a  natural  goal. 


4.1.2  Summary  of  results 

In  this  chapter  we  consider  the  problem  of  devising  query  strategies  in  two  settings.  In  the 
single-instance  setting,  we  are  confronted  with  a  single  optimization  problem,  and  wish 
to  obtain  an  (approximately)  optimal  solution  as  quickly  as  possible.  In  this  setting  we 
provide  a  simple  query  strategy  S2,  and  analyze  its  performance  in  terms  of  a  parameter 
that  is  intended  to  capture  the  unpredictability  of  the  decision  procedure’s  behavior.  We 
then  show  that  our  performance  guarantee  is  optimal  up  to  a  constant  factor. 

In  the  multiple -instance  setting,  we  use  the  same  decision  procedure  to  solve  a  number 
of  optimization  problems,  and  our  goal  is  to  leam  from  experience  in  order  to  improve 
performance.  In  this  setting,  we  prove  that  computing  an  optimal  query  strategy  is  NP- 
hard,  and  discuss  how  algorithms  from  machine  learning  theory  can  be  used  to  leam  a 
good  query  strategy  on-the-fly  while  solving  a  sequence  of  optimization  problems. 

In  the  experimental  section  of  this  chapter,  we  demonstrate  that  query  strategy  S2  can 
be  used  to  create  improved  versions  of  state-of-the-art  algorithms  for  planning  and  job 
shop  scheduling.  In  the  course  of  the  latter  experiments  we  develop  a  simple  method  for 
applying  query  strategies  to  branch  and  bound  algorithms,  which  seems  likely  to  be  useful 
in  other  domains  besides  job  shop  scheduling. 
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4.1.3  Related  work 


The  ramp-up  strategy  was  used  in  the  original  GraphPlan  algorithm  [13]  for  A.I.  planning, 
and  is  conceptually  similar  to  iterative  deepening  [52]. 

In  the  A.I.  planning  community,  alternatives  to  the  ramp-up  strategy  were  investigated 
by  Rintanen  [69],  who  proposed  two  algorithms.  Algorithm  A  runs  the  decision  procedure 
on  the  first  n  decision  problems  in  parallel,  each  at  equal  strength,  where  n  is  a  parameter. 
Algorithm  B  runs  the  decision  procedure  on  all  decision  problems  simultaneously,  with  the 
ith  problem  receiving  a  fraction  of  the  CPU  time  proportional  to  where  7  €  (0, 1)  is  a 
parameter.  Rintanen  showed  that  Algorithm  B  yields  dramatic  performance  improvements 
over  the  ramp-up  strategy  on  a  variety  of  A.I.  planning  benchmarks. 

Our  query  strategy  S2  exploits  binary  search  and  is  quite  different  from  the  three  strate¬ 
gies  just  discussed.  In  the  experimental  section  of  this  chapter,  we  compare  S2  to  the 
ramp-up  strategy  and  to  a  geometric  strategy  based  on  a  Rintanen’s  Algorithm  B. 


4.2  Preliminaries 

In  this  chapter  we  are  interested  in  solving  minimization  problems  of  the  form 

OPT  =  min  c(x) 

x&X 

where  X  is  an  arbitrary  set  and  c  :  X  — >  Z+  is  a  function  assigning  a  positive  integer  cost 
to  each  x  e  X.  We  will  solve  such  a  minimization  problem  by  making  a  series  of  calls  to 
a  decision  procedure  that,  given  as  input  an  integer  k,  determines  whether  there  exists  an 
x  G  X  with  c(x)  <  k.  When  given  input  k,  the  decision  procedure  runs  for  r(k)  time  units 
before  returning  a  (provably  correct)  “yes”  or  “no”  answer.  Thus  from  our  point  of  view, 
a  minimization  problem  is  completely  specified  by  the  integer  OPT  and  the  function  r. 

Definition  (instance).  An  instance  of  a  minimization  problem  is  a  pair  (OPT .  r),  where 
OPT  is  the  smallest  input  for  which  the  decision  procedure  answers  “yes”  and  r(k)  is  the 
CPU  time  required  by  the  decision  procedure  when  it  is  run  with  input  k. 

A  query  is  a  pair  (k,  t).  To  execute  this  query,  one  runs  the  decision  procedure  with 
input  k  subject  to  a  time  limit  t.  Executing  query  q  =  ( k ,  t)  on  instance  I  =  (OPT,  r) 
requires  CPU  time  min  {t,  r(k)}  and  elicits  the  response 

(  yes  if  t  >  r(k)  and  k  >  OPT 

respons e(J,  q)  =  <  no  if  t  >  r(k)  and  k  <  OPT 

[  timeout  if  t  <  r(k)  . 
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We  say  that  a  query  q  eliminates  an  integer  k  if  executing  q  determines  what  side  of 
OPT  that  k  is  on. 

Definition  (elimination).  A  query  q  =  (ko,  t)  eliminates  a  value  k  if  the  response  to  q  is 
“yes”  and  k  >  ko,  or  if  the  response  is  “no”  and  k  <  ko. 

Definition  (query  strategy).  A  query  strategy  S  is  a  function  that  takes  as  input  the  se¬ 
quence  (ri,  T2,  •  •  • ,  of  responses  to  the  first  i  queries,  and  returns  as  output  a  new  query 
(k,  t). 

When  executing  queries  according  to  some  query  strategy,  we  maintain  upper  and 
lower  bounds  on  OPT.  Initially  l  =  1  and  u  —  oo.  If  query  (k,  t)  elicits  a  “no”  response 
we  set  l  <—  k  +  1;  and  if  it  elicits  a  “yes”  reponse  we  set  u  <—  k.  Thus,  any  k  ^  [l,u  —  1] 
has  been  eliminated  by  the  query  strategy. 


4.2.1  Performance  of  query  strategies 


In  the  single-instance  setting,  we  will  evaluate  a  query  strategy  according  to  the  following 
competitive  ratio. 

Definition  (competitive  ratio).  The  competitive  ratio  of  a  query  strategy  S  on  an  instance 
I  =  (OPT,  t )  is  defined  by 

ratio (S,  I )  =  max  /  — 1 

k  i  r(k ) 

where  TeHm(S,  k)  is  CPU  time  required  to  eliminate  k  when  executing  queries  according 
to  strategy  S. 


As  an  example,  consider  running  the  ramp-up  query  strategy  on  the  instance  I  = 
(OPT,  t),  where  r(k)  =  2k~l  for  all  k.  The  ramp-up  strategy  must  be  run  for  CPU  time 
1  +  2  +  4  +  8  +  ...  +  2OPr~1  =  2opt  —  1  in  order  to  eliminate  the  value  OPT ,  and 
OPT  is  the  last  k  value  to  be  eliminated.  The  the  ramp-up  strategy  has  competitive  ratio 
2opt-i  <  2  on  the  instance  1 . 


4.2.2  Behavior  of  r 

The  performance  of  our  query  strategies  will  depend  on  the  behavior  of  the  function  r.  For 
most  decision  procedures  used  in  practice,  we  expect  r(k)  to  be  an  increasing  function 
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for  k  <  OPT  and  a  decreasing  function  for  k  >  OPT.  Previous  work  [75,  85]  has 
shown  that  this  behavior  is  prevalent  in  planning  domains  (e.g.,  see  the  behavior  of  siege 
illustrated  in  Figure  4.1),  and  our  query  strategies  are  designed  to  take  advantage  of  it. 
More  specifically,  our  query  strategies  are  designed  to  work  well  when  r  is  close  to  its 
hull. 

Definition  (hull).  The  hull  of  r  is  the  function 

huir(jfc)  =  min  <  maxr(L),  maxr(ki) 

V  [k0<k  V  fci>fc  V 

Figure  4.2  gives  an  example  of  a  function  r  (gray  bars)  and  its  hull  (dots).  Note  that  the 
region  under  the  curve  hulk  ( A;)  is  not  (in  general)  the  convex  hull  of  the  points  (k,  r(k)). 
Also  note  that  the  functions  r  and  hulk  are  identical  if  r  is  monotonically  increasing  (or 
monotonically  decreasing),  or  if  there  exists  an  x  such  that  r  is  monotonically  increasing 
for  k  <  x  and  monotonically  decreasing  for  k  >  x. 


k 


Figure  4.2:  A  function  r  (gray  bars)  and  its  hull  (dots). 

We  measure  the  discrepancy  between  r  and  its  hull  in  terms  of  the  stretch  of  an  in¬ 
stance. 

Definition  (stretch).  The  stretch  of  an  instance  I  =  {OPT,  t)  is  defined  by 

,  /r,  hui r(ife) 

stretch!  1  =  max - — —  . 

k  T{k) 

The  instance  depicted  in  Figure  4.2  has  a  stretch  of  2  because  r(2)  =  1  while  huHT(2)  = 

2. 
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4.3  The  Single-Instance  Setting 


We  first  consider  the  case  in  which  we  wish  to  design  a  query  strategy  for  use  in  solving 
a  single  instance  /  =  {OPT,  r)  (where  OPT  and  r  are  of  course  unknown  to  us).  Our 
goal  is  to  devise  a  query  strategy  that  minimizes  the  value  of  ratio(S',  I)  for  the  worst 
case  instance  I.  We  assume  OPT  e  {1,2, ...  ,U}  for  some  known  upper  bound  U .  For 
simplicity,  we  also  assume  r{k)  >  1  for  all  k. 

4.3.1  Arbitrary  instances 

In  the  case  where  the  function  r  is  arbitrary,  the  following  simple  query  strategy  Si 
achieves  a  competitive  ratio  that  is  optimal  (to  within  constant  factors).  Si  and  its  analysis 
are  similar  to  those  of  Algorithm  A  of  Rintanen  [69].  We  do  not  advocate  the  use  of  Si. 
Rather,  its  analysis  indicates  the  limits  imposed  by  making  no  assumptions  about  r. 

Query  strategy  Si 

1.  Initialize  T  1,  l  1,  and  u  <—  U. 

2.  While  l  <  u: 

(a)  For  each  k  e  {l,  l  +  1, . . . ,  u  —  1},  execute  the  query  (k,  T), 
and  update  l  and  u  appropriately  (if  the  response  is  “yes”  then 
set  u  < —  k,  and  if  the  response  is  “no”  then  set  l  *—  k  +  1). 

(b)  Set  T  <-  2 T. 

The  analysis  of  Si  is  straightforward.  Consider  some  fixed  k,  and  let  7),  =  2^og2T(fc)l 
be  the  smallest  power  of  two  that  is  >  r{k).  Each  iteration  of  the  loop  consumes  time  at 
most  TU,  and  on  the  iteration  where  T  =  T),,  k  will  be  eliminated.  Thus  the  total  time  it 
takes  to  eliminate  k  is  at  most 

U  +  2U  +  4(7  +  . . .  +  TkU  <  2 ThU  <  Ar{k)U  . 

Because  k  we  arbitrary,  it  follows  that  ratio(S'i,  I)  <  4XJ . 

To  obtain  a  matching  lower  bound,  suppose  that  r{k)  —  1  if  k  —  k*,  and  r(k)  =  oo 
otherwise.  For  any  query  strategy  S,  there  is  some  choice  of  k*  that  forces  S  to  consume 
time  at  least  U  before  executing  a  successful  query1,  which  implies  ratio ( S,  I)  >  U. 

1  We  are  only  considering  deterministic  query  strategies.  For  randomized  query  strategies,  there  must  be 
some  choice  of  k*  that  forces  S  to  consume  expected  time  at  least  [,  before  executing  a  successful  query. 
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These  observations  are  summarized  in  the  following  theorem. 

Theorem  25.  For  any  instance  I,  ratio  (<Sj ,  I)  =  0(U).  Furthermore,  for  any  strategy  S, 
there  exists  an  instance  I  such  that  ratio (5',  I)  =  Q(U). 

4.3.2  Instances  with  low  stretch 

In  practice  we  do  not  expect  r  to  be  as  pathological  as  the  function  used  to  prove  the  lower 
bound  in  Theorem  25.  Indeed,  as  already  mentioned,  in  practice  we  expect  instances  to 
have  low  stretch,  whereas  the  instance  used  to  prove  the  lower  bound  has  infinite  stretch. 
We  now  describe  a  query  strategy  S2  whose  competitive  ratio  is  O  (stretch  (7)  •  log  U),  a 
dramatic  improvement  over  Theorem  25  for  instances  with  low  stretch. 

Like  Si,  strategy  S2  maintains  an  interval  [/,  u)  that  is  guaranteed  to  contain  OPT ,  and 
maintains  a  value  T  that  is  periodically  doubled.  S2  also  maintains  a  “timeout  interval” 
[ti,  tu\  with  the  property  that  the  queries  (ti,  T)  and  (tu,  T)  have  both  been  executed  and 
returned  a  timeout  response. 

Query  strategy  S2 

1.  Initialize  T  2,  l  1,  u  U,  ti  oo,  and  tu  < - oo. 

2.  While  l  <  u: 

(a)  If  f  f  oo  and  [/,  u  —  1]  C  \th  tu]  then  set  T  2 T,  set  f  <—  oo , 

and  set  tu  < - oo. 

(b)  Let  v!  —  u  —  1 .  Define 

(  \^f\  if  [/,  u']  and  [thtu\  are 
disjoint  or  t;  =  oo 

k  =  <  L^J  if  [l,u']  an(i  [ti,tu]  intersect 
and  ti  —  l  >  u!  —  tu 
L^+i+m' j  otherwise. 

(c)  Execute  the  query  ( k ,  T).  If  the  result  is  “yes”  set  u  ^  k;  if 
the  result  is  “no”  set  l  k  +  1;  and  if  the  result  is  “timeout” 
set  ti  <—  min{f;,  k}  and  set  tu  <—  max{tu,  k}. 

Each  query  executed  by  £2  is  of  the  form  (k,T),  where  k  G  [l,u—  1]  but /c  f  [ti,tu].  We 
say  that  such  a  k  value  is  eligible.  The  queries  are  selected  in  such  a  way  that  the  number 
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of  eligible  k  values  decreases  exponentially.  This  is  accomplished  using  what  could  be 
described  as  a  “two-sided”  binary  search.  Once  there  are  no  more  eligible  k  values,  T  is 
doubled  and  [ti,  tu]  is  reset  to  the  empty  interval  (so  each  k  G  [l,u  —  1]  becomes  eligible 
again). 

To  analyze  S2,  we  first  bound  the  number  of  queries  that  can  be  executed  in  between 
updates  to  T.  As  already  mentioned,  the  k  value  defined  in  step  2(b)  belongs  to  the  interval 
[l,  u  —  1 1  but  not  to  the  interval  [ti,  tu\ .  By  examining  each  case,  we  find  that  the  number  of 
k  values  that  have  this  property  goes  down  by  a  factor  of  at  least  |  every  query,  except  for 
the  very  first  query  that  causes  a  timeout.  It  follows  that  the  number  of  queries  in  between 
updates  to  T  is  0(log  U). 

To  complete  the  analysis,  first  note  that  whenever  t/  ^  00  and  tu  7^  —00,  it  holds  that 
r(ti)  >  T  and  r(tu)  >  T.  For  any  k  G  [ti,  tu],  this  implies  huHT(/c)  >  T  (by  definition  of 
hull)  and  thus  r(k)  >  (by  definition  of  stretch).  Now  consider  some  arbitrary  k. 

Once  T  >  stretch (/)  •  r(k)  it  cannot  be  that  k  G  [th  tu\,  so  we  must  have  k  [l,u  —  1] 
before  T  can  be  doubled  again.  Because  there  can  be  at  most  0(log  U )  queries  in  between 
updates  to  T,  it  follows  that  we  have  to  wait  O (stretch (/)  •  r(k)  ■  log  U)  time  before  k  ^ 
[l,  u  —  1].  Because  this  holds  for  all  k,  it  follows  that  ratio(S'i,  I)  =  0(stretch(/)  ■  log  U). 

We  now  use  a  simple  information-theoretic  argument  to  prove  a  matching  lower  bound. 
Fix  some  query  strategy  S.  Let  r(k)  =  1  for  all  k  (clearly,  stretch (/)  =  1).  Assume 
without  loss  of  generality  that  S  only  executes  queries  of  the  form  (k,  1).  For  each  OPT  G 
(1,  2 S  must  elicit  a  unique  sequence  of  “yes”  or  “no”  answers,  one  of  which 
must  have  length  >  [log2  L/ J .  Thus  for  some  choice  of  OPT,  ratio (S,I)  >  ^los| u -  = 
n(stretch(/)  •  log  U).  Thus  we  have  proved  the  following  theorem. 

Theorem  26.  For  any  instance  I,  ratiofST  I)  =  0(stretch(7)  ■  log  U).  For  any  strategy 
S,  there  exists  an  instance  I  such  that  ratio (5)  /)  =  f2(stretch(J)  ■  log  U ). 

4.3.3  Generalizing  67 

Although  the  performance  of  query  strategy  S2  (as  summarized  in  Theorem  26)  is  optimal 
to  within  constant  factors,  in  practice  one  might  want  to  adjust  the  behavior  of  S2  so  as  to 
obtain  better  performance  on  a  particular  set  of  optimization  problems.  Toward  this  end, 
we  generalize  S2  by  introducing  three  parameters:  3  controls  the  value  of  k;  7  controls 
the  rate  at  which  the  time  limit  T  is  increased;  and  p  controls  the  balance  between  the  time 
the  strategy  spends  working  to  improve  its  lower  bound  versus  the  time  it  spends  working 
to  improve  the  upper  bound.  Each  parameter  takes  on  a  value  between  0  and  1.  The 
parameters  were  chosen  so  as  to  include  several  natural  query  strategies  in  the  parameter 
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space.  The  original  strategy  S2  is  recovered  by  setting  (3  —  7  =  p  —  \ .  When  f3  —  7  =  0 
and  p  =  0,  S3  is  equivalent  to  the  ramp-up  query  strategy  (in  which  the  ith  query  is  (i,  00)). 
When  P  =  7  =  0  and  p  =  1,  S3  is  equivalent  to  the  ramp-down  query  strategy  (in  which 
the  ith  query  is  (U  —  i,  00)). 

The  analysis  of  S3  follows  along  exactly  the  same  lines  as  that  of  S2.  Retracing  the 
argument  leading  up  to  Theorem  26  and  working  out  the  appropriate  constant  factors  yields 
the  following  theorem,  which  shows  that  the  class  S3(P,  y,p)  includes  a  wide  variety  of 
query  strategies  with  performance  guarantees  similar  to  that  of  S2  (note  that  the  theorem 
provides  no  guarantees  when  P  =  0  or  7  =  0,  as  in  the  ramp-up  and  ramp-down  strategies). 


Theorem  27.  Let  S  =  S3(P,~f,  p),  where  t)  <  P  <  \,  0  <  7  <  1,  and  0  <  p  <  1.  Then 
for  any  instance  I,  ratio(S,  I)  =  0(J^  ■  stretch(J)  •  log  U ). 


Query  strategy  S3(P,  7,  p) 

1.  Initialize  T  f ,  l  <—  1,  u  U,  ti  00,  and  tu  < - 00. 

2.  While  /  <  u: 

(a)  If  ti  f  00  and  [/,  u  —  1]  C  [ti,tu]  then  set  T  <—  set  ti  00, 

and  set  tu  < - 00. 

(b)  Let  u'  —  u  —  1.  If  [l,u']  and  [ti,tu\  are  disjoint  (or  ti  =  00) 
then  define 

,  f  L(1  -  P)l  +  Pu'\  if  (1  -  p)l  >  p(U  -  v!) 

*  l  +  (1  —  P)u'\  otherwise; 

else  define 

f  l(l-P)l  +  P(ti-l)\  if  (1  -p)(t,  -/) 

k  =  <  >  p{u'  -  tu) 

y  [(1  —  P)u'  +  P(tu  +  1)J  otherwise. 

(c)  Execute  the  query  (k,  T).  If  the  result  is  “yes”  set  u  k;  if 
the  result  is  “no”  set  l  <—  k  +  1;  and  if  the  result  is  “timeout” 
set  ti  <—  min{f/,  k}  and  set  tu  max{tu,  k}. 


114 


4.4  The  Multiple-Instance  Setting 


We  now  consider  the  case  in  which  the  same  decision  procedure  is  used  to  solve  a  sequence 
(xi,  x2, . . . ,  xn)  of  instances  of  some  optimization  problem.  In  this  case,  it  is  natural  to 
attempt  to  learn  something  about  the  instance  sequence  and  select  query  strategies  accord¬ 
ingly. 

Let  S  be  some  set  of  query  strategies,  and  for  any  S  G  S,  let  q(S')  denote  the  CPU  time 
required  to  obtain  an  acceptable  solution  to  instance  Xi  =  ( OPT* ,  r,)  using  query  strategy 
S  (e.g.,  Ci(S)  could  be  the  time  required  to  obtain  a  solution  whose  cost  is  provably  at 
most  a  factor  a  times  optimal,  for  some  a  >  1).  We  consider  the  problem  of  selecting 
query  strategies  in  two  settings:  offline  and  online. 


4.4.1  Computing  an  optimal  query  strategy  offline 

In  the  offline  setting  we  are  given  as  input  the  values  of  Ti(k )  for  all  i  and  k,  and  wish  to 
compute  the  query  strategy 


S*  =  argminV^  Cj(S')  . 
ses  ^ 

This  offline  optimization  problem  arises  in  practice  when  the  instances  (x\,x2,  ■  ■  ■  ,xn) 
have  been  collected  for  use  as  training  data,  and  we  wish  to  compute  the  strategy  S*  that 
performs  optimally  on  the  training  data. 

Unfortunately,  if  S  contains  all  possible  query  strategies  then  computing  S*  is  NP- 
hard.  To  see  this,  suppose  that  our  goal  is  to  obtain  an  approximation  ratio  a  =  U  —  1. 
To  obtain  this  ratio,  we  simply  need  to  execute  a  single  query  that  returns  a  non-timeout 
response.  Consider  the  special  case  that  Ti(k)  G  {1,  oo}  for  all  i  and  k ,  and  without  loss 
of  generality  consider  only  query  strategies  that  issue  queries  of  the  form  (k,  1).  For  our 
purposes,  such  a  query  strategy  is  just  a  permutation  of  the  k  values  in  the  set  {1,  2, . . . ,  U}. 
For  each  k,  let  Ak  =  { Xi  :  T,(k)  =  1}.  To  find  an  optimal  query  strategy,  we  must  order 
the  sets  Ai,  A2, . . . ,  Av  from  left  to  right  so  as  to  minimize  the  sum,  over  all  instances  x, , 
of  the  position  of  the  leftmost  set  that  contains  x^  This  is  exactly  Min-Sum  Set  Cover. 
For  any  e  >  0,  obtaining  a  4  —  e  approximation  to  Min-Sum  Set  Cover  is  NP-hard  [26]. 
Thus  we  have  the  following  theorem. 

Theorem  28.  For  any  e  >  0,  obtaining  a  4  —  e  approximation  to  the  optimal  query  strategy 
is  NP-hard. 
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Certain  special  cases  of  the  offline  problem  are  tractable.  For  example,  suppose  all 
queries  take  the  same  time,  say  T,  (k)  =  t  for  all  i  and  k.  In  this  case  we  need  only  con¬ 
sider  queries  of  the  form  (. k ,  t),  and  any  such  query  elicits  a  non-timeout  response.  A  query 
strategy  can  then  be  specified  as  a  binary  search  tree  over  the  key  set  {1,  2, . . . ,  U}.  The 
optimal  query  strategy  is  simply  the  optimum  binary  search  tree  for  the  access  sequence 
(■ OPTi ,  OPT2, . . . ,  OPTn ),  which  can  be  computed  in  0(U'2)  time  using  dynamic  pro¬ 
gramming  [51].  Similarly,  if  we  consider  arbitrary  t,  but  restrict  ourselves  to  queries  of 
the  form  ( k ,  oo)  (so  that  again  all  queries  succeed),  dynamic  programming  can  be  used  to 
compute  an  optimal  query  strategy.  Finally,  the  offline  problem  is  tractable  if  S  is  small 
enough  for  us  to  search  through  it  by  brute  force.  Based  on  the  results  of  the  previous 
section,  a  natural  choice  would  be  for  S  to  include  S3(/3,^^p)  for  various  values  of  the 
three  parameters. 

As  discussed  in  Chapter  2,  a  simple  greedy  algorithm  achieves  the  optimal  approxima¬ 
tion  ratio  of  4  for  Min- Sum  Set  Cover,  and  it  is  natural  to  wonder  whether  this  greedy 
algorithm  can  be  generalized  to  obtain  a  4-approximation  to  the  optimal  query  strategy. 
We  are  not  aware  of  any  straightforward  way  of  accomplishing  this.  Note  that  in  general, 
the  optimal  query  strategy  will  adapt  its  choice  of  later  queries  based  on  the  results  of 
earlier  queries,  whereas  the  natural  generalization  of  the  greedy  algorithm  for  Min-Sum 
Set  Cover  would  produce  a  static  list  of  queries. 


4.4.2  Selecting  query  strategies  online 


We  now  consider  the  problem  of  selecting  query  strategies  in  an  online  setting,  assuming 
that  |<S|  is  small  enough  that  we  would  not  mind  using  0(|«S|)  time  or  space  for  decision¬ 
making.  In  the  online  setting  we  are  fed,  one  at  a  time,  a  sequence  (xi,x2,  ■  ■  ■ ,  xn)  of 
problem  instances  to  solve.  Prior  to  receiving  instance  xt,  we  must  select  a  query  strategy 
Si  G  S.  We  then  use  Si  to  solve  xt  and  incur  cost  c?  (,S't  ).  Our  regret  at  the  end  of  n  rounds 
is  equal  to 


1 

n 


E 


X>(«) 


1=1 


n 

min  > 

ses  ^ 

i=  1 


Ci{S) 


(4.1) 


where  the  expectation  is  over  any  random  bits  used  by  our  strategy-selection  algorithm. 
That  is,  regret  is  4  times  the  difference  between  the  expected  total  cost  incurred  by  our  on¬ 
line  algorithm  and  that  of  the  optimal  query  strategy  for  the  (unknown)  set  of  n  instances. 
An  online  algorithm’s  worst-case  regret  is  the  maximum  value  of  (4.1)  over  all  instance  se¬ 
quences  of  length  n.  A  no-regret  algorithm  has  worst-case  regret  that  is  o(l)  as  a  function 
of  n. 
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We  now  describe  how  two  existing  algorithms  can  be  applied  to  the  problem  of  se¬ 
lecting  query  strategies.  Let  M  be  an  upper  bound  on  cfiS),  and  let  T  be  an  upper 
bound  on  Ti(k).  Viewing  our  online  problem  as  an  instance  of  the  “nonstochastic  mul¬ 
tiarmed  bandit  problem”  and  using  the  Exp3  algorithm  of  Auer  el  al.  [5]  yields  regret 

O  (^M y/77  6j  j  =  o(l).  The  second  algorithm  makes  use  of  the  fact  that  on  any  partic¬ 
ular  instance  xt,  we  can  obtain  enough  information  to  determine  the  value  of  ct(,S'j  for 
all  S  G  S  by  executing  the  query  (k,  T)  for  each  k  G  {1,  2, . . . ,  U}.  This  requires  CPU 
time  at  most  TU.  We  can  then  use  the  “label-efficient  forecaster”  of  Cesa-Bianchi  et  al. 
[17]  to  select  query  strategies.  Theorem  1  of  that  paper  shows  that  the  regret  is  at  most 
M  ^ln  ^  +  enTU ,  where  rj  and  £  are  parameters.  Optimizing  rj  and  £  yields  re¬ 
gret  O  j  =  o(l).  Given  n  as  input,  one  can  choose  whichever  of  the  two 

algorithms  yields  the  smaller  regret  bound. 


4.5  Experimental  Evaluation 

In  this  section  we  evaluate  query  strategy  S2  experimentally  by  using  it  to  create  modified 
versions  of  state-of-the-art  solvers  in  two  domains:  classical  A.I.  planning  and  job  shop 
scheduling.  In  both  of  these  domains,  we  found  that  the  number  of  standard  benchmark 
instances  was  too  small  for  the  online  algorithms  discussed  in  the  previous  section  to  be 
effective.  Accordingly,  our  experimental  evaluation  focuses  on  the  techniques  developed 
for  the  single-instance  setting. 

4.5.1  Planning 

The  planners  entered  in  the  2006  International  Planning  Competition  were  divided  into 
two  categories:  optimal  planners  always  return  a  plan  of  provably  minimum  makespan, 
whereas  satisficing  planners  simply  return  a  feasible  plan  quickly.  In  this  section  we  pursue 
a  different  goal:  obtaining  a  provably  near-optimal  plan  as  quickly  as  possible. 

As  already  mentioned,  S  ATP  LAN  finds  a  minimum-makespan  plan  by  making  a  se¬ 
quence  of  calls  to  a  SAT  solver  that  answers  questions  of  the  form  “Does  there  exist  a 
plan  of  makespan  <  kT\  The  original  version  of  S  ATP  LAN  tries  k  values  in  an  increasing 
sequence  starting  from  k  —  1,  stopping  as  soon  as  it  obtains  a  “yes”  answer.  We  compare 
the  original  version  to  a  modified  version  that  instead  uses  query  strategy  S2.  When  using 
S2  we  do  not  share  any  work  (e.g.,  intermediate  result  files)  among  queries  with  the  same 
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k  value,  although  doing  so  could  improve  performance. 

We  ran  each  of  these  two  versions  of  S  ATP  LAN  on  benchmark  instances  from  the 
2006  International  Planning  Competition,  with  a  one  hour  time  limit  per  instance,  and 
recorded  the  upper  and  lower  bounds  we  obtained.  To  obtain  an  initial  upper  bound,  we 
ran  the  satisficing  planner  SGPlan  [37]  with  a  one  minute  time  limit.  We  chose  SGPlan 
because  it  won  first  prize  in  the  satisficing  planning  track  of  last  year’s  competition.  If 
SGPlan  found  a  feasible  plan  within  the  one  minute  time  limit,  we  used  the  number  of 
actions  in  that  plan  as  an  upper  bound  on  the  optimum  makespan;  otherwise  we  artificially 
set  the  upper  bound  to  100. 


Table  4.1:  Performance  of  two  query  strategies  on  bench¬ 
mark  instances  from  the  pathways  domain  of  the  2006 
International  Planning  Competition.  Bold  numbers  indicate 
the  (strictly)  best  upper/lower  bound  we  obtained. 


Instance 

SATPLAN  (S2) 
[lower, upper] 

SATPLAN  (Sg) 
[lower, upper] 

SATPLAN  (original) 
[lower, upper] 

pOl 

[5,5] 

[5,5] 

[5,5] 

p02 

[7,7] 

[7,7] 

[7,7] 

p03 

[8,8] 

[8,8] 

[8,8] 

p04 

[8,8] 

[8,8] 

[8,8] 

p05 

[9,9] 

[9,9] 

[9,9] 

p06 

[12,12] 

[12,12] 

[12,12] 

p07 

[13,13] 

[13,13] 

[13,13] 

p08 

[15,17] 

[16,17] 

[16,  oo] 

p09 

[15,17] 

[15,17] 

[15,oo] 

plO 

[15,15] 

[15,15] 

[15,15] 

pll 

[16,17] 

[16,17] 

[16, oo] 

pl2 

[16,19] 

[17,19] 

[17,oo] 

pl3 

[16,18] 

[17,18] 

[17,oo] 

pl4 

[14,20] 

[15,19] 

[15,oo] 

pl5 

[18,18] 

[18,18] 

[18,18] 

pl6 

[17,21] 

[19,22] 

[19,oo] 

pl7 

[19,21] 

[20,22] 

[20, oo] 

pi  8 

[19,22] 

[19,23] 

[19,oo] 

pl9 

[17,22] 

[18,24] 

[18, oo] 

continued  on  next  page. . . 
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Table  4.1  (continued  from  previous  page) 


Instance 

SATPLAN  (S2) 
[lower,  upper] 

SATPLAN  (Sg) 
[lower,  upper] 

SATPLAN  (original) 
[lower, upper] 

p20 

[17,28] 

[18,27] 

[19,oo] 

p21 

[20,25] 

[21,25] 

[22,oo] 

p22 

[17,23] 

[18,26] 

[19,oo] 

p23 

[17,25] 

[17,25] 

[18,oo] 

p24 

[21,27] 

[21,28] 

[22,oo] 

p25 

[20,27] 

[20, 00] 

[21,oo] 

p26 

[19,27] 

[20,31] 

[21,oo] 

p27 

[19,34] 

[20,31] 

[20, 00] 

p28 

[19,27] 

[20, 00] 

[21,oo] 

p29 

[19,29] 

[18,29] 

[18, 00] 

p30 

[20,60] 

[21,oo] 

[21,oo] 

Table  4.1  presents  our  results  for  30  instances  from  the  pathways  domain.  Numbers 
in  bold  indicate  an  upper  or  lower  bound  obtained  by  one  query  strategy  that  was  strictly 
better  than  the  bound  obtained  by  any  other  query  strategy.  Not  surprisingly,  S2  always 
obtains  upper  bounds  that  are  as  good  or  better  than  those  obtained  by  the  ramp-up  strategy. 
Interestingly,  the  lower  bounds  obtained  by  S2  are  only  slightly  worse,  differing  by  at  most 
two  parallel  steps  from  the  lower  bound  obtained  by  the  ramp-up  strategy.  Examining  the 
ratio  of  the  upper  and  lower  bounds  obtained  by  S2,  we  see  that  for  26  out  of  the  30 
instances  it  finds  a  plan  whose  makespan  is  (provably)  at  most  1.5  times  optimal,  and  for 
all  but  one  instance  it  obtains  a  plan  whose  makespan  is  at  most  two  times  optimal.  In 
contrast,  the  ramp-up  strategy  does  not  find  a  feasible  plan  for  21  of  the  30  instances. 
Thus  on  the  pathways  domain,  the  modified  version  of  SATPLAN  using  query  strategy 
S‘2  gives  behavior  that  is  in  many  ways  better  than  that  of  the  original. 

To  better  understand  the  performance  of  S2 ,  we  also  compared  it  to  a  geometric  query 
strategy  Sg  inspired  by  Algorithm  B  of  Rintanen  [69].  This  query  strategy  behaves  as 
follows.  It  initializes  T  to  1.  If  Z  and  u  are  the  initial  lower  and  upper  bounds,  it  then 
executes  the  queries  (k,  T^k~l)  for  each  k  =  {7,  l  +  1, . . . ,  u  —  1},  where  7  G  (0, 1) 
is  a  parameter.  It  then  updates  /  and  u,  doubles  T,  and  repeats.  Based  on  the  results 
of  Rintanen  [69]  we  set  7  =  0.8.  We  do  not  compare  to  Rintanen’s  Algorithm  B  directly 
because  it  requires  many  runs  of  the  SAT  solver  to  be  performed  in  parallel,  which  requires 
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an  impractically  large  amount  of  memory  for  some  of  the  benchmark  instances  considered 
in  our  evaluation. 

The  results  for  Sg  are  shown  in  the  second  column  of  Table  4. 1 .  Like  S2,  Sg  always  ob¬ 
tains  upper  bounds  that  are  as  good  or  better  than  those  of  the  ramp-up  strategy.  Compared 
to  S2,  Sg  generally  obtains  slightly  better  lower  bounds  and  slightly  worse  upper  bounds. 
Unlike  S2,  Sg  does  not  obtain  any  non-trivial  upper  bound  for  three  of  the  30  instances. 

Similar  tables  for  the  remaining  six  problem  domains  are  available  online  at  http : 
//www .  cs  .  emu  .  edu/ "matt s/icaps07 /appendixA . pdf .  For  the  storage, 
rovers,  and  trucks  domains,  our  results  are  similar  to  the  ones  presented  in  Table 
4.1:  S2  achieved  significantly  better  upper  bounds  than  ramp-up  and  slightly  worse  lower 
bounds,  while  Sg  achieved  slightly  better  lower  bounds  than  S2  and  slightly  worse  up¬ 
per  bounds.  For  the  openstacks,  TPP,  and  pipesworld  domains,  our  results  were 
qualitatively  different:  most  instances  in  these  domains  were  either  easy  enough  that  all 
three  query  strategies  found  a  provably  optimal  plan,  or  so  difficult  that  no  strategy  found 
a  feasible  plan,  with  the  ramp-up  strategy  yielding  the  best  lower  bounds. 

To  gain  more  insight  into  these  results,  we  plotted  the  function  r{k)  for  various  in¬ 
stances.  Broadly  speaking,  we  encountered  two  types  of  behavior:  either  r(k)  increased 
as  a  function  of  k  for  k  <  OPT  but  decreased  as  a  function  of  k  for  k  >  OPT ,  or  r(k) 
increased  as  a  function  of  k  for  all  k.  Figure  4.3  (A)  and  (B)  give  prototypical  examples 
of  these  two  behaviors.  The  gross  behavior  of  r  on  a  particular  instance  was  largely  deter¬ 
mined  by  the  problem  domain.  For  instances  from  the  pathways,  storage,  trucks, 
and  rovers  domains  r  tended  to  be  increasing-then-decreasing,  while  for  instances  from 
the  TPP  and  pipesworld  domain  r  tended  to  be  monotonically  increasing,  explaining 
the  qualitative  difference  between  our  results  in  these  two  sets  of  domains.  For  most  in¬ 
stances  in  the  openstacks  domain  we  found  no  k  values  that  elicited  a  “yes”  answer  in 
reasonable  time;  hence  we  cannot  characterize  the  typical  behavior  of  r. 


4.5.2  Job  shop  scheduling 

In  this  section,  we  use  query  strategy  S2  to  create  a  modified  version  of  a  branch  and  bound 
algorithm  for  job  shop  scheduling.  We  chose  the  algorithm  of  Brucker  et  cil.  [15]  (hence¬ 
forth  referred  to  as  Brucker)  because  it  is  one  of  the  state-of-the-art  branch  and  bound 
algorithms  for  job  shop  scheduling,  and  because  code  for  it  is  freely  available  online. 

Given  a  branch  and  bound  algorithm,  one  can  always  create  a  decision  procedure  that 
answers  the  question  “Does  there  exist  a  solution  with  cost  at  most  kV  as  follows:  ini¬ 
tialize  the  global  upper  bound  to  k  +  1  (here  we  are  assuming  the  objective  function  is 
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(A)  trucks/p  7.pddl 


(B)  pipesworld/p21.pddl 


Makespan  bound  (k) 


Figure  4.3:  Behavior  of  the  SAT  solver  siege  running  on  formulae  generated  by 
SATPLAN  to  solve  (A)  instance  p7  from  the  trucks  domain  and  (B)  instance  p21  from 
the  pipesworld  domain  of  the  2006  International  Planning  Competition. 


integer- valued),  and  run  the  algorithm  until  either  a  solution  with  cost  <  k  is  discovered 
(in  which  case  the  result  of  the  query  is  “yes”)  or  the  algorithm  terminates  without  finding 
such  a  solution  (in  which  case  the  result  is  “no”).  Note  that  the  decision  procedure  returns 
the  correct  answer  independent  of  whether  k  +  1  is  a  valid  upper  bound.  A  query  strategy 
can  be  used  in  conjunction  with  this  decision  procedure  to  find  optimal  or  approximately 
optimal  solutions  to  the  original  minimization  problem. 

We  evaluate  two  versions  of  Brucker:  the  original  and  a  modified  version  that  uses 
S2.  We  ran  both  versions  on  the  instances  in  the  OR  library  [10]  with  a  one  hour  time 
limit  per  instance,  and  recorded  the  upper  and  lower  bounds  obtained.  We  do  not  evaluate 
the  ramp-up  strategy  or  Sg  in  this  context,  because  they  were  not  intended  to  work  well 
on  problems  such  as  job  shop  scheduling,  where  the  number  of  possible  k  values  is  very 
large. 

On  50  of  the  benchmark  instances,  both  query  strategies  found  a  (provably)  optimal 
solution  within  the  time  limit.  Table  4.2  presents  the  results  for  the  remaining  instances. 
As  in  Table  4.1,  bold  numbers  indicate  an  upper  or  lower  bound  that  was  strictly  better 
than  the  one  obtained  by  the  competing  algorithm. 

With  the  exception  of  just  one  instance  (la2  5),  the  modified  algorithm  using  query 
strategy  S2  obtains  better  lower  bounds  than  the  original  branch  and  bound  algorithm. 
This  is  not  surprising,  because  the  lower  bound  obtained  by  running  the  original  branch 
and  bound  algorithm  is  simply  the  value  obtained  by  solving  the  relaxed  subproblem  at 
the  root  node  of  the  search  tree,  and  is  not  updated  as  the  search  progresses.  What  is 
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Table  4.2:  Performance  of  two  query  strategies  on  benchmark  instances  from  the  OR 
library.  Bold  numbers  indicate  the  (strictly)  best  upper/lower  bound  we  obtained. 


Instance 

Brucker  (,S'2) 

[lower, upper] 

Brucker  (original) 

[lower,  upper] 

abz7 

[650,712] 

[650,726] 

abz8 

[622,725] 

[597,767] 

abz9 

[644,728] 

[616,820] 

ft20 

[1165,1165] 

[1164,1179] 

la21 

[1038,1070] 

[995,1057] 

la25 

[971,979] 

[977,977] 

la26 

[1218,1227] 

[1218,1218] 

la27 

[1235,1270] 

[1235,1270] 

la28 

[1216,1221] 

[1216,1273] 

la29 

[1118,1228] 

[1114,1202] 

la38 

[1176,1232] 

[1077,1228] 

la40 

[1211,1243] 

[1170,1226] 

swvOl 

[1391,1531] 

[1366,1588] 

swv02 

[1475,1479] 

[1475,1719] 

swv03 

[1373,1629] 

[1328,1617] 

swv04 

[1410,1632] 

[1393,1734] 

swv05 

[1414,1554] 

[1411,1733] 

swv06 

[1572,1943] 

[1513,2043] 

swv07 

[1432,1877] 

[1394,1932] 

swv08 

[1614,2120] 

[1586,2307] 

swv09 

[1594,1899] 

[1594,2013] 

swvlO 

[1603,2096] 

[1560,2104] 

swvll 

[2983,3407] 

[2983,3731] 

swvl2 

[2971,3455] 

[2955,3565] 

swvl3 

[3104,3503] 

[3104,3893] 

swvl4 

[2968,3350] 

[2968,3487] 

swvl5 

[2885,3279] 

[2885,3583] 

ynl 

[813,987] 

[763,992] 

yn2 

[835,1004] 

[795,1037] 

yn3 

[812,982] 

[793,1013] 

yn4 

[899,1158] 

[871,1178] 
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Figure  4.4:  Behavior  of  Brucker  running  on  OR  library  instance  f 1 1 0 . 


more  surprising  is  that  the  upper  bounds  obtained  by  S2  are  also,  in  the  majority  of  cases, 
substantially  better  than  those  obtained  by  the  original  algorithm.  This  indicates  that  the 
speculative  upper  bounds  created  by  SVs  queries  are  effective  in  pruning  away  irrelevant 
regions  of  the  search  space  and  forcing  the  branch  and  bound  algorithm  to  find  low-cost 
schedules  more  quickly.  These  results  are  especially  promising  given  that  the  technique 
used  to  obtain  them  is  domain-independent  and  could  be  applied  to  other  branch  and  bound 
algorithms.  In  related  work,  Streeter  &  Smith  [81]  improved  the  performance  of  Brucker 
by  using  an  iterated  local  search  algorithm  for  job  shop  scheduling  to  obtain  valid  upper 
bounds  and  also  to  refine  the  branch  ordering  heuristic. 

To  better  understand  these  results,  we  manually  examined  the  function  r(k)  for  a  num¬ 
ber  of  instances  from  the  OR  library.  In  all  cases,  we  found  that  r(k)  increased  smoothly 
up  to  a  point  and  then  rapidly  decreased  in  a  jagged  fashion.  Figure  4.4  illustrates  this  be¬ 
havior.  The  smooth  increase  of  r{k)  as  a  function  of  k  for  k  <  OPT  reflects  the  fact  that 
proving  that  no  schedule  of  makespan  <  k  exists  becomes  more  difficult  as  k  gets  closer 
to  OPT.  The  jaggedness  of  r(k)  for  k  >  OPT  can  be  seen  as  an  interaction  between 
two  factors:  for  k  >  OPT,  increasing  k  leads  to  less  pruning  (increasing  r ( k ) )  but  also 
to  a  weaker  termination  criterion  (reducing  it).  In  spite  of  this,  the  curve  has  low  stretch 
overall,  and  thus  its  shape  can  be  exploited  by  query  strategies  such  as  S2  . 
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4.6  Conclusions 


Optimization  problems  are  often  solved  using  an  algorithm  for  the  corresponding  decision 
problem  as  a  subroutine.  In  this  chapter,  we  considered  the  problem  of  choosing  which 
queries  to  submit  to  the  decision  procedure  so  as  to  obtain  an  (approximately)  optimal 
solution  as  quickly  as  possible.  Our  main  contribution  was  a  new  query  strategy  S2  that  has 
attractive  theoretical  guarantees  and  appears  to  perform  well  in  practice.  Experimentally, 
we  showed  that  S2  can  be  used  to  create  improved  versions  of  state-of-the-art  algorithms 
for  classical  A.I.  planning  and  job  shop  scheduling.  Given  the  success  of  our  experiments 
with  a  branch  and  bound  algorithm  for  job  shop  scheduling,  an  interesting  direction  for 
future  work  would  be  to  apply  S2  in  other  domains  where  branch  and  bound  algorithms 
work  well,  for  example  integer  programming  or  resource-constrained  project  scheduling. 
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Chapter  5 

The  Max  A; -Armed  Bandit  Problem 

5.1  Introduction 


In  the  classical  k- armed  bandit  problem  one  is  faced  with  a  set  of  k  slot  machines,  each 
of  which  has  an  arm  that,  when  pulled,  yields  a  payoff  drawn  independently  at  random 
from  a  fixed  (but  unknown)  distribution.  The  goal  is  to  allocate  trials  to  the  arms  so  as 
to  maximize  the  cumulative  payoff  received  over  a  series  of  n  trials.  Solving  the  problem 
entails  striking  a  balance  between  exploration  (determining  which  arm  yields  the  highest 
mean  payoff)  and  exploitation  (repeatedly  pulling  this  arm). 

In  the  max  variant  of  the  A-armcd  bandit  problem,  the  goal  is  to  maximize  the  maxi¬ 
mum  (rather  than  cumulative)  payoff.  This  version  of  the  problem  arises  in  practice  when 
tackling  combinatorial  optimization  problems  for  which  a  number  of  randomized  search 
heuristics  exist:  given  k  heuristics,  each  yielding  a  stochastic  outcome  when  applied  to 
some  particular  problem  instance,  we  wish  to  allocate  trials  to  the  heuristics  so  as  to  maxi¬ 
mize  the  maximum  payoff  (e.g.,  the  maximum  number  of  clauses  satisfied  by  any  sampled 
variable  assignment,  the  maximum  quality  of  any  sampled  schedule).  Cicirello  and  Smith 
[21]  show  that  a  max  k- armed  bandit  approach  yields  good  performance  on  the  resource- 
constrained  project  scheduling  problem  with  maximal  time  lags  (RCPSP/max),  a  difficult 
real-world  scheduling  problem. 

Formally,  an  instance  /  =  (G i,  G2, . . . ,  Gk)  of  the  max  A  -armcd  bandit  problem  is  a 
A;-tuple  of  probability  distributions,  each  thought  of  as  an  arm  on  a  slot  machine.  The  ith 
arm,  when  pulled,  returns  a  payoff  drawn  independently  at  random  from  distribution  G, . 
A  strategy  S  is  a  rule  for  determining  which  arm  to  pull  next,  as  a  function  of  the  results 
of  previous  pulls.  For  any  strategy  S,  instance  /,  and  positive  integer  n,  let  M(I,S,n ) 
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denote  the  maximum  payoff  obtained  by  following  strategy  S  on  instance  /  for  n  pulls. 
The  regret  of  strategy  S  on  instance  /  after  n  pulls  is  equal  to 


max{E[M*]}-E[M(/,5,n)] 


(5.1) 


where  Mln  denotes  maximum  of  n  independent  draws  from  G,  (i.e.,  the  maximum  payoff 
obtained  by  pulling  arm  i  every  time).  Note  that,  in  contrast  to  the  classical  A  -armcd 
bandit  problem  (where  the  goal  is  to  maximize  cumulative  payoff),  the  optimal  strategy 
for  a  particular  instance  /  does  not  necessarily  consist  of  pulling  a  single  arm  for  all  n 
pulls.1  Thus,  it  is  possible  for  a  strategy  to  have  negative  regret  on  some  instances. 

The  worst-case  regret  of  strategy  S  is  the  maximum  value  of  (5.1)  as  a  function  of 
k  and  n.  We  say  that  S  is  a  no-regret  strategy  if,  for  any  fixed  k,  the  worst-case  regret 
is  o(l)  as  a  function  of  n.  Note  that  this  is  stronger  than  simply  requiring  that,  for  any 
particular  instance,  the  regret  approaches  zero  as  n  — >  oo.  Indeed,  as  long  as  payoffs  are 
bounded  then  simply  sampling  the  arms  in  round-robin  order  meets  the  latter  requirement. 
However,  as  Theorem  29  in  the  next  section  shows,  round-robin  sampling  is  not  a  no-regret 
strategy. 

In  this  chapter,  we  present  two  strategies  for  the  max  k- armed  bandit  problem.  We 
first  discuss  a  simple  strategy  called  Threshold  Ascent  which  is  designed  to  work  well  for 
a  wide  variety  of  payoff  distributions  encountered  in  practice  (Theorem  29  shows  that  no 
strategy  can  be  expected  to  work  well  for  all  payoff  distributions).  We  then  discuss  a  sec¬ 
ond  strategy  that  has  strong  theoretical  guarantees  in  the  special  case  where  each  payoff 
distribution  is  a  generalized  extreme  value  (GEV)  distribution  (defined  in  §5.3.1).  The  mo¬ 
tivation  for  studying  this  special  case  is  the  Extremal  Types  Theorem  [23],  which  singles 
out  the  GEV  as  the  limiting  distribution  of  the  maximum  of  a  large  number  of  independent 
identically  distributed  (i.i.d.)  random  variables.  Roughly  speaking,  one  can  think  of  the 
Extremal  Types  Theorem  as  an  analogue  of  the  Central  Limit  Theorem.  Just  as  the  Central 
Limit  Theorem  states  that  the  sum  of  a  large  number  of  i.i.d.  random  variables  converges 
in  distribution  to  a  Gaussian,  the  Extremal  Types  Theorem  states  that  the  maximum  of  a 
large  number  of  i.i.d.  random  variables  converges  in  distribution  to  a  GEV. 

The  results  presented  in  this  chapter  are  based  on  two  conference  papers  [80,  82]. 


'As  an  example,  suppose  there  are  two  arms.  Arm  A  always  returns  payoff  while  arm  B  returns  payoff 
1  with  probability  0.01  and  payoff  zero  otherwise.  For  large  n,  the  optimal  strategy  is  to  pull  arm  A  once 
and  then  pull  arm  B  the  remaining  n  —  1  times. 
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5.1.1  Related  work 


The  classical  k- armed  bandit  problem  was  first  studied  by  Robbins  [70]  and  has  since 
been  the  subject  of  numerous  papers;  see  Berry  and  Fristedt  [11]  and  Kaelbling  [40]  for 
overviews. 

The  max  variant  of  the  A-armcd  bandit  problem  was  introduced  by  Cicirello  and  Smith 
[19,  21],  whose  experiments  with  randomized  priority  dispatching  rules  for  the  RCPSP/max 
form  the  basis  of  our  experimental  evaluation  in  §5.4.  Cicirello  and  Smith  show  that  their 
max  k- armed  bandit  strategy  yields  performance  on  the  RCPSP/max  that  is  competitive 
with  the  state  of  the  art.  The  design  of  Cicirello  and  Smith’s  strategy  is  motivated  by  an 
analysis  of  the  special  case  in  which  each  arm’s  payoff  distribution  is  a  GEV  distribution 
with  shape  parameter  £  =  0. 

5.1.2  A  negative  result  for  arbitrary  payoff  distributions 

Ideally,  we  would  like  to  come  up  with  a  no-regret  strategy  for  the  max  A  -armcd  bandit 
problem  that  requires  as  few  distributional  assumptions  as  possible.  As  a  first  step,  it  seems 
reasonable  to  require  that  all  payoffs  come  from  a  bounded  interval  (this  will  be  true  in  all 
the  applications  we  intend  to  consider).  In  fact,  a  no-regret  strategy  does  not  exist  even 
under  the  stronger  assumption  that  all  payoffs  are  either  0  or  1,  as  the  following  theorem 
shows. 

Theorem  29.  For  any  max  k-armed  bandit  strategy  S  and  any  positive  integer  n,  there 
exists  a  max  k-armed  bandit  instance  on  which  S  has  regret  at  least  1  —  ■£  —  A  after  n 
pulls. 

Proof.  Let  Ij  =  (G i,  G 2, . . . ,  Gu )  denote  a  max  A;-armed  bandit  instance  in  which  distri¬ 
bution  Gi  always  returns  payoff  0  for  1  f  j,  while  distribution  Gj  returns  payoff  1  with 
probability  A  and  returns  payoff  0  otherwise.  Thus,  pulling  arm  Gj  for  all  n  pulls  yields 
expected  maximum  payoff 


It  suffices  to  show  that  for  some  choice  of  j,  S  receives  expected  maximum  payoff  at 
most  A.  To  see  this,  assume  without  loss  of  generality  that  the  behavior  of  S  is  unaffected 
by  the  payoffs  it  receives.  This  is  without  loss  of  generality  because  all  payoffs  are  in 
{0, 1},  and  once  S  receives  a  payoff  of  1  its  subsequent  choices  have  no  effect  on  the 
maximum  payoff  that  it  receives.  Under  this  assumption,  there  must  be  some  arm  j  whose 
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expected  number  of  pulls  is  <  l  Thus  on  instance  Ij ,  the  expected  total  payoff  that  S 
receives  is  <  which  implies  that  the  expected  maximum  payoff  is  <S-  D 


5.2  A  Simple,  Distribution-Free  Approach 

In  this  section,  we  do  not  assume  that  the  payoff  distributions  belong  to  any  specific  para¬ 
metric  family.  In  fact,  we  will  not  make  any  formal  assumptions  at  all  about  the  payoff 
distributions,  although  (as  Theorem  29  shows)  our  approach  cannot  be  expected  to  work 
well  if  the  distributions  are  chosen  adversarially.  Roughly  speaking,  our  approach  will 
work  best  when  the  following  two  criteria  are  satisfied. 

1.  There  is  a  (relatively  low)  threshold  tcriticai  such  that,  for  all  t  >  tcriticai,  the  arm 
that  is  most  likely  to  yield  a  payoff  >  t  is  the  same  as  the  arm  most  likely  to  yield  a 
payoff  >  tcriticai ■  Call  this  arm  i*. 

2.  As  t  increases  beyond  t critical,  there  is  a  growing  gap  between  the  probability  that 
arm  i*  yields  a  payoff  >  t  and  the  corresponding  probability  for  other  arms.  Specif¬ 
ically,  if  we  let  pi(t )  denote  the  probability  that  the  ith  arm  returns  a  payoff  >  t,  the 
ratio  should  increase  as  a  function  of  t  for  t  >  tcriticai,  for  any  i  ^  i*. 

Figure  5.1  illustrates  a  set  of  two  payoff  distributions  that  satisfy  these  assumptions. 

In  this  section  we  present  a  new  algorithm,  Chernoff  Interval  Estimation,  for  the  clas¬ 
sical  A;-armed  bandit  problem  and  prove  a  bound  on  its  regret.  Our  algorithm  is  simple  and 
has  performance  guarantees  competitive  with  the  state  of  the  art.  Building  on  Chernoff 
Interval  Estimation,  we  develop  a  new  algorithm.  Threshold  Ascent,  for  solving  the  max 
k- armed  bandit  problem.  Our  algorithm  is  designed  to  work  well  as  long  as  the  two  mild 
distributional  assumptions  just  described  are  satisfied.  In  §5.4  we  evaluate  Threshold  As¬ 
cent  experimentally  by  using  it  to  select  among  randomized  priority  dispatching  rules  for 
the  RCPSP/max.  We  find  that  Threshold  Ascent  (;)  performs  better  than  any  of  the  prior¬ 
ity  rules  perform  in  isolation,  and  (ii)  outperforms  the  recent  QD-BEACON  max  A-armcd 
bandit  algorithm  of  Cicirello  and  Smith  [19,  21]. 

5.2.1  Chernoff  Interval  Estimation 

In  this  section  we  present  and  analyze  a  simple  algorithm,  Chernoff  Interval  Estimation, 
for  the  classical  k- armed  bandit  problem.  In  §5.2.2  we  use  this  algorithm  as  subroutine  in 
Threshold  Ascent,  an  algorithm  for  the  max  k- armed  bandit  problem. 
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Figure  5.1:  A  max  k- armed  bandit  instance  on  which  Threshold  Ascent  should  perform 
well. 


In  the  classical  k- armed  bandit  problem  one  is  faced  with  a  set  of  k  arms.  The  ith  arm, 
when  pulled,  returns  a  payoff  drawn  independently  at  random  from  a  fixed  (but  unknown) 
distribution.  All  payoffs  are  real  numbers  between  0  and  1.  We  denote  by  //,  the  expected 
payoff  obtained  from  a  single  pull  of  arm  i,  and  define  fi*  =  maxi<j<fc  Hi-  We  consider 
the  finite-time  version  of  the  problem,  in  which  our  goal  is  to  maximize  the  cumulative 
payoff  received  using  a  fixed  budget  of  n  pulls.  The  regret  of  an  algorithm  (on  a  partic¬ 
ular  instance  of  the  classical  k- armed  bandit  problem)  is  the  difference  between  /i*n  (the 
expected  cumulative  payoff  the  algorithm  would  have  received  by  pulling  the  single  best 
arm  n  times)  and  the  expected  cumulative  payoff  the  algorithm  receives  on  the  instance. 

Chernoff  Interval  Estimation  is  simply  the  well-known  interval  estimation  algorithm 
[40,  57]  with  confidence  intervals  derived  using  Chemoff’s  inequality.  Although  various 
interval  estimation  algorithms  have  been  analyzed  in  the  literature  and  a  variety  of  guaran¬ 
tees  have  been  proved,  both  (i)  our  use  of  Chemoff’s  inequality  in  an  interval  estimation 
algorithm  and  (ii)  our  analysis  appear  to  be  novel.  In  particular,  when  the  mean  payoff 
returned  by  each  arm  is  small  (relative  to  the  maximum  possible  payoff)  our  algorithm  has 
much  better  performance  than  the  recent  algorithm  of  Auer  et  al.  [4],  which  is  identical 
to  our  algorithm  except  that  confidence  intervals  are  derived  using  Hoeffding’s  inequality. 
We  give  further  discussion  of  related  work  later  in  this  section. 
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Classical  A>armed  bandit  strategy  Chernoff  Interval  Estimation 

Input:  positive  integer  n,  real  number  5  G  (0, 1) 


1.  Initialize  xt  <—  0,  rx*  0  Vi  G  {1,2,...,  k}. 

2.  Repeat  n  times: 

(a)  i  <—  arg  maXj  U (/x*,  n*),  where  /X*  =  ^  and 


U  (/x,  n0) 


if„0>0 

cx)  otherwise 


where  cc  =  In  (^) . 

(b)  Pull  arm  i,  receive  payoff  R,  set  x ■  +  f?,  and  set  n- 

nj  +  1. 


We  now  bound  the  expected  regret  of  Chernoff  Interval  Estimation.  Our  analysis  pro¬ 
ceeds  as  follows.  Lemma  13  shows  that  (with  a  certain  minimum  probability)  the  value 
U  (//,,  ra* )  is  always  an  upper  bound  on  //, .  Lemma  14  then  places  a  bound  on  the  number 
of  times  the  algorithm  will  sample  an  arm  whose  mean  payoff  is  suboptimal.  Theorem  30 
puts  these  results  together  to  obtain  a  bound  on  Chernoff  Interval  Estimation’s  worst-case 
regret. 


We  will  make  use  of  the  following  well-known  Chernoff  bound,  which  we  simply  refer 
to  as  “Chernoff’s  inequality”. 

Chernoff ’s  inequality.  Let  X  =  i  be  the  sum  of  n  independent  identically  dis¬ 
tributed  random  variables  with  Xi  G  [0, 1]  and  /x  =  E  [X*].  Then  for  j3  >  0, 


and 


P 


P 


—  <  (1  -/% 
n 


—  >  (1  +  /3)/x 
n 


<  exp 


<  exp 


xx/x/32 

2~ 

npf32 


We  will  also  use  the  following  algebraic  fact,  which  holds  by  construction. 

Fact  3.  If  z  —  U (/x,  7i0)  then 

(  /i  \  ^ 

zn0  |^1 - J  =2  a  . 
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Lemma  13.  During  a  run  of  Chernoff  Interx’al  Estimation(n,  8 )  it  holds  with  probability 
at  least  1  —  f  that  for  all  arms  i  G  {1,2,...,  k)  and  for  all  n  repetitions  of  the  loop, 

U(p,i,  rii)  >  p,i. 


Proof  It  suffices  to  show  that  for  any  arm  i  and  any  particular  repetition  of  the  loop, 
P  [U ( pi ,  nf)  <  pi]  <  Consider  some  particular  fixed  values  of  pt,  a,  and  rii,  and  let 
pc  be  the  largest  solution  to  the  equation 


U{pc,m)  Pi  (5.2) 

By  inspection,  U(pc,rii )  is  strictly  increasing  as  a  function  of  pc.  Thus  U(pi,rii )  <  pt  if 
and  only  if  /q  <  pc,  so  P  [U ( Jli ,  nf)  <  pi ]  =  P  [pi  <  pc]-  Thus 


P  [U  ( pi ,  nf)  <  pi]  =  P  [ pi  <  pc 


=  P 


Pi  Pi 


=  exp  (—a) 


5 

2  nk 


(Chernoff’s  inequality) 
(Fact  3  and  equation  5.2) 


□ 

Lemma  14.  During  a  run  of  Chernoff  Interx’al  Estimation(n,  8)  it  holds  with  probability 

at  least  1  —  5  that  each  suboptimal  arm  i  (i.e.,  each  arm  i  with  p,  <  p*)  is  pulled  at  most 

7, — W  times,  where  y,  = 
r  (i -\/yl)  y  m 


Proof  Let  i*  be  some  optimal  arm  (i.e.,  pn*  =  p*)  and  assume  that  U(pi*,ni*)  >  p* 
for  all  n  repetitions  of  the  loop.  By  Lemma  13,  this  assumption  is  valid  with  probability 
at  least  1  —  Consider  some  particular  suboptimal  arm  i.  By  inspection,  we  will  stop 
sampling  arm  i  once  U ( pi ,  nf)  <  p*.  So  it  suffices  to  show  that  if 


^  3a  1 
Ui  ~  v^)2 


(5.3) 


then  U  ( pi ,  nf)  <  p*  with  probability  at  least  1  —  ^  (then  the  probability  that  any  of  our 
assumptions  fail  is  at  most  f  +  —  5). 
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To  show  this,  it  suffices  to  show  that  (5.3)  implies  two  things.  First,  with  probability  at 
least  1  —  we  have  ff  <  Vv?  Hi.  Second,  if  Hi  <  then  U(Hi,ni)  <  H*  ■ 

We  first  show  that  (5.3)  implies  that  with  probability  at  least  1  —  fa 
This  is  true  because 


P 


A H  ->  \/  Vi  hi 


=  P 

<  exp 
=  exp 


^  ,  1  ,  1  -  Vyi\  .. 

Hi  ''  I  1  —  I  Hi 


A /Ui 
TliHi  (1  y/lTi) 


riiH 


(1  -  ffylf 


<  exp  (—a) 

5  5 

2  nk  2k 


(Chemoff’s  inequality) 


(equation  5.3) 


To  complete  the  proof,  we  show  that  (5.3)  implies  that  if  Hi  <  y/yU Hu  then  U (/q,  nt)  < 
H*.  To  see  this,  let  Ui  =  U (//,,  nt),  and  suppose  for  contradiction  that  Ui>  h* ■  By  Fact  3, 


The  right  hand  side  increases  as  a  function  of  Hi  (assuming  Hi  <  Ui,  which  is  true  by 
definition).  So  if  Hi  <  \fv7  Hi  then  replacing  //,  with  \Jyr  1  //,  only  increases  the  value 
of  the  right  hand  side.  Similarly,  the  right  hand  side  decreases  as  a  function  of  Ui,  so  if 
Ui  >  h*  then  replacing  replacing  f/t  with  fi*  only  increases  the  value  of  the  right  hand 
side.  Thus 


Hi  < 


2a 

H* 


2a 

H* 


(i  -  Vm)~2 


which  contradicts  (5.3). 


□ 


The  following  theorem  shows  that  when  n  is  large  (and  the  parameter  6  is  small),  the 
total  payoff  obtained  by  Chernoff  Interval  Estimation  over  n  trials  is  almost  as  high  as 
what  would  be  obtained  by  pulling  the  single  best  arm  for  all  n  trials. 

Theorem  30.  The  regret  incurred  by  Chernoff  Interx’al  Estimation(n,  5)  is  at  most 

2y/3  H*n(k  —  l)a  + 

where  a  =  In  (^). 
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Proof.  First,  assume  that  k  >  2  (for  k  —  1  the  theorem  holds  trivially). 

The  conclusion  of  Lemma  14  fails  to  hold  with  probability  at  most  5.  Because  regret 
cannot  exceed  n*n,  this  scenario  contributes  at  most  Sfi*n  to  overall  regret.  Thus  it  remains 
to  show  that,  conditioned  on  the  event  that  the  conclusion  of  Lemma  2  holds,  regret  is  at 
most  2y/3/j,*n(k  —  l)a. 

Consider  some  arm  i  with  /r,  <  jx* .  Let  y  —  and  let  n,  be  the  number  of  times  arm 
i  is  pulled.  By  Lemma  14,  rin  <  •  Each  pull  of  arm  i  adds  ft*  —  Hi  —  /x*(l  —  y) 

to  the  regret.  Thus,  letting  R,  be  the  total  regret  incurred  due  to  pulling  arm  i,  we  have 

o  3&{f-y) 

Using  the  fact  that  y  <  1,  we  have 

1  —  y  1  —  y 

(1  -  ^yf  =  (1  -  ^yf 

(1  +  Vv? 
i -y 
4 


Thus 

R,  <  min  <[  nyi*  ( 1  — 


For  any  fixed  nu  this  expression  is  maximized  when  1  —  y  —  2,/-^.  Thus  R,  < 

2yJ3niii*a. 

Assume  without  loss  of  generality  that  arm  1  is  optimal  (i.e.,  [i\  =  n*).  Then  the  total 
regret  is  at  most  Yli= 2  —  Y^i= 2  2  f‘in.,n*o.  Subject  to  the  constraint  n,  <  n,  this 

expression  is  maximized  when  nt  =  for  all  i  (2  <  i  <  k).  Thus  the  total  regret  is  at 

most  2 a/3 n*n(k  —  l)a,  as  claimed.  □ 

We  now  compare  Theorem  30  to  previous  regret  bounds  for  the  classical  k- armed 
bandit  problem. 

Types  of  Regret  Bounds 

In  comparing  the  regret  bound  of  Theorem  30  to  previous  work,  we  must  distinguish  be¬ 
tween  two  different  types  of  regret  bounds.  The  first  type  of  bound  describes  the  asymp- 


(i  +  Vv? 
(1  +  Vv)2 


12  a  1 

’i —y)  ' 
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totic  behavior  of  regret  (as  n  — >  oo)  on  &  fixed  problem  instance  (i.e.,  with  all  k  payoff  dis¬ 
tributions  held  constant).  In  this  framework,  a  lower  bound  of  £l(ln(n))  has  been  proved, 
and  algorithms  exist  that  achieve  regret  0(ln(n))  [4].  Though  we  do  not  prove  it  here, 
Chemoff  Interval  Estimation  also  achieves  0(ln(n))  regret  in  this  framework  when  5  is 
set  appropriately. 

The  second  type  of  bound  concerns  the  maximum,  over  all  possible  instances,  of  the 
regret  incurred  by  the  algorithm  when  run  on  that  instance  for  n  pulls.  In  this  setting,  a 
lower  bound  of  kl(y/kn)  has  been  proved  [5].  It  is  this  second  form  of  bound  that  Theorem 
30  provides.  In  what  follows,  we  will  only  consider  bounds  of  this  second  form. 


The  Classical  A; -Armed  Bandit  Problem 

We  are  not  aware  of  any  work  on  the  classical  k- armed  bandit  problem  that  offers  a  bet¬ 
ter  regret  bound  (of  the  second  form)  than  the  one  proved  in  Theorem  30.  Auer  et  al. 
[4]  analyze  an  algorithm  that  is  identical  to  ours  except  that  the  confidence  intervals  are 
derived  from  Hoeffding’s  inequality  rather  than  Chernoff’s  inequality.  An  analysis  anal¬ 
ogous  to  the  one  given  in  this  chapter  shows  that  their  algorithm  has  worst-case  regret 
0{  \Jnk  ln(rt))  when  the  instance  is  chosen  adversarially  as  a  function  of  n.  Plugging 
6  =  \  into  Theorem  30  gives  a  bound  of  0  ( \Jrijj*k  hi  (a) ) ,  which  is  never  any  worse 
than  the  latter  bound  (because  /i*  <  1)  and  is  much  better  when  /i*  is  small. 


The  Nonstochastic  Multiarmed  Bandit  Problem 

In  a  different  paper,  Auer  et  al.  [5]  consider  a  variant  of  the  classical  k- armed  bandit 
problem  in  which  the  sequence  of  payoffs  returned  by  each  arm  is  determined  adversarially 
in  advance.  For  this  more  difficult  problem,  they  present  an  algorithm  called  Exp3.1  with 
expected  regret 

8y/ (e  —  l)Gmax^  In  (A;)  +  8(e  —  1  )k  +  2k  ln(/c) 

where  Gmax  is  the  maximum,  over  all  k  arms,  of  the  total  payoff  that  would  be  obtained 
by  pulling  that  arm  for  all  n  trials.  If  we  plug  in  Gmax  =  n*n,  this  bound  is  sometimes 
better  than  the  one  given  by  Theorem  30  and  sometimes  not,  depending  on  the  values  of 
n,  k,  and  /a*,  as  well  as  the  choice  of  the  parameter  5. 
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5.2.2  Threshold  Ascent 


To  solve  the  max  k- armed  bandit  problem,  we  use  Chemoff  Interval  Estimation  to  maxi¬ 
mize  the  number  of  payoffs  that  exceed  a  threshold  T  that  varies  over  time.  Initially,  we 
set  T  to  zero.  Whenever  s  or  more  payoffs  >  T  have  been  received  so  far,  we  increment  T. 
We  refer  to  the  resulting  algorithm  as  Threshold  Ascent.  To  ease  explanation,  we  assume 
that  all  payoffs  are  integer  multiples  of  some  known  constant  A  when  presenting  the  code 
for  Threshold  Ascent. 


Max  /  -armed  bandit  strategy  Threshold  Ascent 

Input:  positive  integers  n  and  s,  real  number  5  G  (0, 1). 

1.  Initialize  T  <—  0  and  nf  =  0,  Vi  G  {1,2 G 
{0,  A,  2A, . . . ,  1  —  A,  1}. 

2.  Repeat  n  times: 

(a)  While  Eli  Si(T)  >  s  do: 

T  <-  T  +  A 

where  Sft)  =  ^R>t  nf  is  the  number  of  payoffs  >  t  received 
so  far  from  arm  i. 

(b)  i  <—  arg  max,  U  (^pp- n^j ,  where  ri,  =  nf  is  the  number 
of  times  arm  i  has  been  pulled  and 


|  a+y/2n0fia+a2 

U (n,  n0)  —  <  ^  + 

if  n0  >  0 

[  oo 

otherwise 

where  a  =  In  [pp). 

Pull  arm  i,  receive  payoff  R,  and  set  nf  <- 

-  nTi  +  1. 

1 

The  parameter  s  controls  the  tradeoff  between  exploration  and  exploitation.  To  under¬ 
stand  this  tradeoff,  it  is  helpful  to  consider  two  extreme  cases. 

Case  s  —  1.  Threshold  Ascent ( 1,  n,  (5)  is  equivalent  to  round-robin  sampling.  When 
s  —  1,  the  threshold  T  is  incremented  whenever  a  payoff  >  T  is  obtained.  Thus  the  value 
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pp  calculated  in  2  (b)  is  always  0,  so  the  value  of  U  (Jpp,  n-ij  is  determined  strictly  by 
rij.  Because  U  is  a  decreasing  function  of  nt.  the  algorithm  simply  samples  whatever  arm 
has  been  sampled  the  smallest  number  of  times  so  far. 

Case  s  —  oo.  Threshold  Ascent(oo ,n,S)  is  equivalent  to  Chernoff  Interval  Estimation 
(n,  d  )  running  on  a  A  -armcd  bandit  instance  where  payoffs  >  T  are  mapped  to  1  and  pay¬ 
offs  <  T  are  mapped  to  0. 


5.3  A  No-Regret  Algorithm  for  GEV  Payoff  Distributions 

In  this  section  we  consider  a  restricted  version  of  the  max  k- armed  bandit  problem  in 
which  each  arm  yields  payoff  drawn  from  a  generalized  extreme  value  (GEV)  distribu¬ 
tion  (defined  in  §5.3.1).  This  section  presents  the  first  provably  asymptotically  optimal 
algorithm  for  this  problem. 

Roughly  speaking,  the  reason  for  assuming  a  GEV  distribution  is  the  Extremal  Types 
Theorem  (stated  in  §5.3.1),  which  states  that  the  distribution  of  the  sample  maximum  of 
n  independent  identically  distributed  random  variables  approaches  a  GEV  distribution  as 
n  — >  oo.  In  fact,  there  are  two  arguments  for  assuming  that  each  arm  is  a  GEV  distri¬ 
bution.  First,  in  practice  the  distribution  of  payoffs  returned  by  a  strong  heuristic  may 
be  approximately  GEV,  even  if  the  conditions  of  the  Extremal  Types  Theorem  are  not 
formally  satisfied  [19]. 

A  second  argument  runs  as  follows.  Suppose  /  =  (G i,  G2,  ■  ■ . ,  Gk)  is  an  instance 
of  the  max  k- armed  bandit  problem  in  which  each  distribution  G,  satisfies  the  conditions 
required  by  the  Extremal  Types  Theorem.  Consider  the  instance  I  =  (G  \ .  G2, . . . ,  Gk), 
where  returns  the  maximum  of  m  samples  from  Gt.  Effectively,  I  is  a  restricted  version 
of  /  in  which  the  arms  must  be  pulled  in  batches  of  size  m,  rather  than  in  any  arbitrary 
order.  For  m  sufficiently  large,  the  Extremal  Types  Theorem  guarantees  that  for  each  i, 
Gi  is  approximately  equal  to  a  GEV,  call  it  G\.  Thus,  the  instance  V  =  (G\ ,  G'2, . . . ,  G'k) 
is  approximately  equivalent  to  the  original  instance  /,  and  satisfies  our  distributional  as¬ 
sumptions. 

The  form  our  algorithm  is  very  simple.  Initially,  the  algorithm  pulls  each  arm  a  fixed 
number  of  times.  Based  on  the  observed  payoffs,  the  algorithm  then  estimates,  for  each 
arm,  the  (expected)  maximum  payoff  that  would  be  obtained  by  pulling  that  arm  for  all 
remaining  trials.  The  arm  with  the  highest  estimate  is  then  used  for  all  remaining  trials. 
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An  algorithm  of  this  form  has  previously  been  analyzed  for  the  classical  k- armed  bandit 
problem  [28].  As  it  turns  out,  the  analysis  in  the  case  of  the  max  k- armed  bandit  problem 
is  considerably  more  technical. 

For  reasons  that  will  become  clear,  the  nature  of  our  results  depends  on  the  shape 
parameter  (0  of  the  GEV  distributions.  Assuming  all  arms  have  £  <  0,  we  obtain  a 
strategy  whose  regret  is  o  (1).  In  the  exotic  case  where  one  or  more  arms  have  £  >  0,  the 
expected  maximum  payoff  obtained  by  pulling  the  best  arm  n  times  grows  without  bound, 
and  grows  too  fast  for  us  to  be  able  to  obtain  additive  regret  that  is  o  (1).  In  this  case,  we 
obtain  expected  maximum  payoff  within  a  factor  1  —  o  (1)  of  that  of  the  best  arm. 

We  should  note  up  front  that  the  results  presented  in  this  section  are  primarily  of  theo¬ 
retical  interest,  and  are  quite  a  bit  more  technical  than  the  results  presented  in  the  previous 
section.  In  our  experimental  evaluation,  we  found  that  Threshold  Ascent  performed  much 
better  than  the  no-regret  strategy  for  GEV  payoff  distributions  described  in  this  section. 

5.3.1  Background:  extreme  value  theory 

This  section  provides  a  self-contained  overview  of  results  in  extreme  value  theory  that  are 
relevant  to  this  work.  Our  presentation  is  based  on  the  text  by  Coles  [23]. 

The  central  result  of  extreme  value  theory  is  an  analogue  of  the  Central  Limit  Theorem 
that  applies  to  extremely  rare  events.  Recall  that  the  Central  Limit  Theorem  states  that 
(under  certain  regularity  conditions)  the  distribution  of  the  sum  of  n  independent,  identi¬ 
cally  distributed  (i.i.d)  random  variables  converges  to  a  normal  distribution  as  n  — >  oo. 
The  Extremal  Types  Theorem  states  that  (under  certain  regularity  conditions)  the  distribu¬ 
tion  of  the  maximum  of  n  i.i.d  random  variables  converges  to  a  generalized  extreme  value 
(GEV)  distribution. 

Definition  (GEV  distribution).  A  random  variable  Z  has  a  generalized  extreme  value 
distribution  if,  for  constants  p,  a  >  0,  and  £,  P  [Z  <  z\  —  GEV^^^  ( z ),  where 

C,EVt^a(z)  =  exp  (l  +  Z'j  ') 


for  z  such  that  1  +  £(z  —  p)a  1  >  0,  and  GEV^a^{z)  =  1  otherwise.  The  case  £  =  0  is 
interpreted  as  the  limit 

hm  GEV^a}p)(z)  =  exp  exp 
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The  following  three  propositions  establish  properties  of  the  GEV  distribution. 
Proposition  1.  Let  Z  be  a  random  variable  with  P  [Z  <  z\  =  GEV^a^(z).  Then 


E[Z] 


//+|(r(i-o-i)  n<  1,^0 

T  +  <*1  ifi  =  0 

oo  ifi>  1 


where 

T(z)  = 

is  the  complete  gamma  function  and 


tz  1  exp(— t)  dt 


7  =  lim 

n— >  oo 


ln(n) 


w  Euler’s  constant. 

We  now  introduce  some  additional  notation.  Let  G  =  GEV, be  a  GEV  distribu¬ 
tion,  and  let  the  random  variable  Mn  equal  the  maximum  of  n  independent  samples  from 
G. 

Proposition  2.  Mn  has  distribution  GEVyyyy  where 

,  =  f  /i+  |  (n*  -  1)  if£  f  0 
\  p  +  a  in  (n)  otherwise, 

a'  =  arf,  and 

Substituting  the  parameters  of  Mn  given  by  Proposition  2  into  Proposition  1  gives  an 
expression  for  E  [Mn\. 

Proposition  3.  Let  G  =  GEVyay  where  £  <  1.  Then 

E[Mn]  =  G  +  H* r(i-fl-i)  if l  *  0 

[  /x  +  CT7  +  cr  m(n)  otherwise. 


It  follows  that 


for  £  >  0,  E  [Mn]  is  (~)(rf  ); 
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Figure  5.2:  The  effect  of  the  shape  parameter  (£)  on  the  expected  maximum  of  n  indepen¬ 
dent  draws  from  a  GEV  distribution. 


•  for  £  =  0,  E  [Mn]  is  0(ln(n));  and 

•  for  £  <  0,  E  [Mn]  =  fi  —  |  —  0(n^)  . 


In  the  analysis  that  follows  in  later  sections,  it  will  be  useful  to  have  a  visual  picture 
of  what  Proposition  3  means.  Figure  1  plots  E  [Mn]  as  a  function  of  In  n  for  three  GEV 
distributions  with  p  =  0,  o  =  1,  and  £  e  {0.1,  0,  —0.1}.  When  the  shape  parameter  £  is 
negative,  the  expected  maximum  payoff  approaches  an  asymptote  as  n  — >  oo;  when  £  =  0, 
the  expected  maximum  payoff  grows  linearly  as  a  function  of  Inn;  and  when  £  >  0,  the 
expected  maximum  payoff  grows  super-linearly  as  a  function  of  In  n. 


The  central  result  of  extreme  value  theory  is  the  following  theorem. 

The  Extremal  Types  Theorem.  Let  G  be  an  arbitrary  cumulative  distribution  function, 
and  suppose  there  exist  sequences  of  constants  {an  >  0}  and  { bn }  such  that 

'M?  -  br 


lim  P 

n— kx) 


<  z 


=  G*  z) 


(5.4) 


for  any  continuity  point  z  ofG*,  where  G*  is  a  not  a  point  mass.  Then  there  exist  constants 
p,  a  >  0,  and  £  such  that  G*(z)  =  GEV^a^(z)  Vz.  Furthermore, 

lim  P  [Mn  <z}=  GEV^+b^a^z)  . 
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Condition  (5.4)  holds  for  a  variety  of  distributions  including  the  normal,  lognormal, 
uniform,  and  Cauchy  distributions. 

5.3.2  A  no-regret  algorithm 

In  this  section  we  will  analyze  the  max  k- armed  bandit  strategy  SGEV  shown  below.  Here, 
and  throughout  this  section,  we  use  Mln  to  denote  the  maximum  payoff  obtained  by  pulling 
the  ith  arm  n  times,  and  we  define 


ml 


Max  /  -armed  bandit  strategy  SGEV 
Input:  real  numbers  e  >  0,  <5  G  (0, 1) 

1.  (Exploration)  For  each  arm  G,  e  Q: 

Using  t  =  O  ^ln(|)ln^  )  samples  of  Gt,  obtain  an  estimate  fhln 

of  mln.  Assuming  that  arm  G,  has  shape  parameter  G  <  0,  our 
estimate  will  have  the  property  that 

P  [| mln  —  mln\  <  e]  >  1  —  8  . 

2.  (Exploitation)  Set  i  =  arg  maxKKi.  fhln,  and  pull  arm  G-  for  the 
remaining  n  —  tk  trials. 


If  an  arm  G.;  has  shape  parameter  >  0,  the  estimate  obtained  in  step  1  (a)  will  instead 

>  1  —  5  for  constant  ol\  independent  of 


have  the  property  that  P 
n. 


Ijm  ^  +  e 
1+e  mn—ai 


Assumptions 

We  require  that  each  arm  Gt  have  finite,  bounded  mean  and  variance.  To  ensure  this,  it  suf¬ 
fices  to  assume  that  each  arm  Gj  =  GEV^^^  is  a  GEV  distribution  whose  parameters 
satisfy 

1  •  |  Ei  |  G  PJu 
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2.  0  <  ae  <  Oi  <  au 

3.  &  <  ^  <  &  <  l 

for  known  constants  pu,  oe,  au,  and  £u. 


Analysis 


The  following  theorem  shows  that  with  appropriate  settings  of  e  and  <5,  strategy  SGEl  is 
asymptotically  optimal  when  each  arm  has  shape  parameter  &  <  0.  In  Appendix  A,  we 
establish  a  similar  guarantee  (using  the  same  parameter  settings)  when  one  or  more  arms 
have  >  0. 


Theorem  31.  Let  I  =  (G  {,  G2,  ■  ■  ■ ,  Gif  be  an  instance  of  the  max  k-armed  bandit  prob¬ 
lem,  where  Gi  =  GEV^^^p,  and  <  0  for  all  i.  Then  strategy  SGEV ,  run  on  instance 

I  with  parameters  e  =  and  5  =  has  regret  O  (\n(nk)  In  (n)2  3 ^ 


Proof  (sketch).  There  are  three  potential  sources  of  regret.  We  will  show  that  the  contri¬ 
bution  from  each  source  is  O  (A),  where  A  =  In (nk)  ln(n)2-^|. 

First,  with  probability  at  most  k5  =  one  of  the  estimates  obtained  during  the 
exploration  phase  will  be  more  than  e  away  from  its  true  value.  However,  if  all  arms 
have  f,  A  0  then  by  Proposition  3,  the  maximum  regret  is  O  (In  n).  Thus,  this  possibility 
contributes  O  (^ff)  =o(A)to  regret. 

The  second  source  of  regret  is  that,  even  if  all  estimates  are  within  e  of  their  true 
values,  the  expected  maximum  payoff  from  n  pulls  of  the  arm  i  selected  at  the  end  of  the 
exploration  phase  could  be  up  to  2e  smaller  than  that  of  some  other  arm.  This  possibility 
contributes  at  most  2e  to  regret,  and  e  =  O  (A). 

The  final  source  of  regret  is  that,  due  to  the  time  spent  on  the  exploration  phase,  the 
presumed  best  arm  i  will  only  be  pulled  n  —  t(k  —  1)  times,  rather  than  n  times.  To 
complete  the  proof,  we  show  in  Appendix  A  that  for  any  arm  i, 

mn  ~  K-t{k- 1)  =  0  =  O  (A)  . 

□ 


Theorem  3 1  completes  our  analysis  of  the  performance  of  SGE '  .  It  remains  only  to 
describe  how  the  estimates  in  step  1  (a)  are  obtained. 
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Obtaining  the  required  estimates 

We  now  describe  how  to  obtain  accurate  estimates  of  the  expected  maximum  of  n  inde¬ 
pendent  draws  from  a  GEV  distribution.  Although  the  estimation  procedure  itself  is  not 
complicated,  the  proofs  that  the  estimates  have  the  required  properties  are  quite  techni¬ 
cal.  In  this  section,  we  provide  only  sketches  of  these  proofs,  deferring  the  full  proofs  to 
Appendix  A. 

We  adopt  the  following  notation: 

•  Let  G  =  GEV(IJ  a£)  denote  a  GEV  distribution  with  (unknown)  parameters  /i,  a, 
and  £  satisfying  the  conditions  stated  in  §5.3.1,  and 

•  let  rrij  be  the  expected  maximum  of  j  samples  from  G. 

Our  procedure  for  estimating  mn  is  as  follows.  First,  we  obtain  an  accurate  estimate 
of  £.  Then 

1.  if  £  «  0  (so  that  the  growth  of  mn  as  a  function  of  In  n  is  linear),  we  estimate  mn  by 
first  estimating  m\  and  m2,  then  performing  a  linear  interpolation; 

2.  otherwise  we  estimate  mn  by  first  estimating  mi,  m2,  and  m4,  then  performing  a 
nonlinear  interpolation. 

Our  estimation  procedure  will  take  a  different  form  depending  on  the  GEV  distribution 
is  estimated  to  have  shape  parameter  £  <  0,  £  =  0,  or  £  >  0.  Although  we  will  analyze 
all  three  cases,  it  is  worth  noting  that  the  case  £  <  0  is  really  the  only  one  that  can 
arise  in  practice.  This  is  true  because  in  any  real  combinatorial  optimization  problem  the 
maximum  payoff  is  bounded  from  above,  which  (by  Proposition  3)  can  only  happen  when 

£  <  0. 

For  the  purpose  of  the  proofs  presented  in  this  section,  we  will  make  an  additional  mi¬ 
nor  assumption  concerning  an  arm’s  shape  parameter  £,:  we  assume  that  for  some  known 
constant  £*  >  0, 

161  <  r  =>■  6  =  0 . 

Removing  this  assumption  does  not  fundamentally  change  the  results,  but  it  makes  the 
proofs  more  complicated. 

We  will  repeatedly  make  use  of  the  following  lemma,  which  shows  how  to  accurately 
estimate  rn3  when  j  is  small  (the  required  number  of  samples  grows  linearly  with  j).  The 
proof  is  given  in  Appendix  A,  and  uses  a  standard  probabilistic  trick  called  the  “median  of 
means”  method. 
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Lemma  15.  Let  j  be  a  positive  integer  and  let  e  >  0  and  5  G  (0, 1)  be  real  numbers.  Then 
O  (in  Q)  f)  draws  from  G  suffice  to  obtain  an  estimate  fhj  of  nij  such  that 

P  [| fhj  —  rrij\  <  e]  >  1  —  8  . 


Our  first  lemma  proves  a  bound  on  the  number  of  samples  needed  to  accurately  esti¬ 
mate  £. 

Lemma  16.  For  reed  numbers  e  >  0  and  5  G  (0, 1),  O  (ln(4)4)  draws  from  G  suffice  to 
obtain  an  estimate  £  off  such  that 

P[|e-£|  <e]  >1-6. 


Proof  (sketch).  Using  Proposition  3,  it  is  straightforward  to  check  that  for  any  £  <  1, 


£ 


log2 


m  4  —  m  2 

777.2  —  777 1 


(5.5) 


Let  777 1,  t772,  and  m4  be  estimates  of  mi,  m2,  and  m4,  respectively,  and  let  £  be  the 
estimate  of  £  obtained  by  plugging  777 4,  7772,  and  tt74  into  (5.5). 

It  can  be  shown  (see  the  full  proof  in  Appendix  A)  that 

l£-fl  =  °l  X!  I%_m7 

\7'e{l,2,4} 

Thus  to  guarantee  P  [|£  —  £|  <  e]  >  1  —  S,  it  suffices  that 

£ 

P[|%  -mf  <  0(e)]  >  1  -  - 

for  all  j  G  (1,  2, 4}.  By  Lemma  15,  this  requires  O  (ln(| )  4)  draws  from  G.  □ 

Having  just  shown  how  to  accurately  estimate  £,  it  remains  to  describe  how  to  estimate 
mn  given  knowledge  of  £.  Lemmas  17,  18  and  19  show  how  to  efficiently  estimate  mn 
in  the  cases  £  =  0,  £  <  0  and  £  >  0,  respectively.  The  case  £  =  0  is  by  far  the  most 
straightforward. 

Lemma  17.  Assume  G  has  shape  parameter  £  =  0.  Let  77  be  a  positive  integer  and  let 
e  >  0  and  5  G  (0,1)  be  real  numbers.  Then  O  ^ln(f)ln^  j  draws  from  G  suffice  to 
obtain  an  estimate  fhn  of  mn  such  that 

P  [  1 777n  —  777„|  <  e]  >  1  —  6  . 
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Proof.  By  Proposition  3,  rn3  —  p  +  cry  +  a  In (j).  Thus 


mn  =  nii  +  (n ^2  —  tn  4)  log2(n)  . 


(5.6) 


Let  rhi  and  m2  be  estimates  of  rn \  and  /b2,  respectively,  and  let  fhn  be  the  estimate  of  mn 
obtained  by  plugging  rhi  and  fh2  into  (5.6).  Define  A 3  =  | fhj  —  mf  for  j  e  (1,2 ,n}. 
Then 

A n  A  (1  +  log2(n))(^l  +  ^2)  - 


Thus  to  guarantee  P  [An  <  e]  >  1  —  5,  it  suffices  that  P 


all  j  e  (1,  2}.  By  Lemma  15,  this  requires  O  ( InQ 


'll  (In n)2 


A  <  _ ? _ 

3  -  2(l+log2(n)) 

draws  from  G. 


>  1 


for 

□ 


Lemma  18.  Assume  G  has  shape  parameter  £  <  —f*.  Let  n  be  a  positive  integer  and  let 
e  >  0  and  5  e  (0, 1)  be  real  numbers.  Then  O  (ln(4)  A)  draws  from  G  suffice  to  obtain  an 
estimate  mn  of  mn  such  that 


P  [ \mn  —  mn |  <  e]  >  1  —  <5  . 

Proof  ( sketch ).  By  Proposition  3 , 

m.j  =  n+j  (/T(  1  -0-1)  • 

Define 

OL\  =  d  ~  of1 

a2  =  ^"1r(i-0 

a3  =  2? 

so  that 

m.j  =  an  +  a2ci3g2^  .  (5.7) 

Plugging  in  the  values  j  =  1,  j  =  2,  and  j  =  4  into  (5.7)  yields  a  system  of  three  quadratic 
equations.  Solving  this  system  for  an,  a2,  and  a3  yields 

an  =  (m4m4  —  —  2  m2  +  m4)_1 

a2  =  (—2  mim2  +  rn\  +  m|)(mi  —  2  m2  +  m4)-1 

«3  =  («r4  -  m2){m2  -  miY1  . 


Let  fhi,  w2,  and  m4  be  estimates  of  mi,  m2,  and  m4,  respectively.  Plugging  rhi,  fh2, 
and  rn4  into  the  above  equations  yields  estimates,  say  an,  a2,  and  d3,  of  an,  oi2,  and  a3, 
respectively. 
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With  no  small  amount  of  algebraic  effort,  it  can  be  shown  (see  the  full  proof  in  Ap¬ 
pendix  A)  that 

|  mn  —  mn  |  =  0  |  fhj  —  'in j 

\i£{  1,2,4} 

Thus  to  guarantee  P  [\fhn  —  mn\  <  e]  >  1  —  5,  it  suffices  that 

P  [|  fhj  —  rrij |  <  fl(e)]  >  1  —  - 

o 

for  all  j  G  (1,  2, 4}.  By  Lemma  15,  this  requires  0(ln(|)^)  draws  from  G.  □ 

Lemma  19.  Assume  G  has  shape  parameter  £  >  £*.  Let  n  /«'  c/  positive  integer  and  let 
e  >  0  and  5  G  (0,1)  he  real  numbers.  Then  O  ^ln(i)+— j  draws  from  G  suffice  to 
obtain  an  estimate  fhn  of  mn  such  that 


P 


1 

1  +  6 


Win 

vnn 


<  (1  +  e) 


>1  —  5 


where  on  —  p  —  |. 


Proof  See  Appendix  A. 


□ 


Putting  the  results  of  lemmas  16,  17,  18,  and  19  together,  we  obtain  the  following 
theorem. 

Theorem  32.  Let  n  be  a  positive  integer  and  let  e  >  0  and  5  G  (0,1 )  be  real  numbers. 
Then  O  ^ln(f)ll+  j  draws  from  G  suffice  to  obtain  an  estimate  fhn  ofmn  such  that  with 
probability  at  least  1  —  5,  one  of  the  following  holds: 

•  £  <  0  and  \mn  —  mn\  <  e,  or 

•  £  >  0  and  <  ™n~Ql  <  1  +  e,  where  =  a  -  f . 

Proof.  First,  invoke  Lemma  16  with  parameters  y  and  Then  invoke  one  of  Lemmas 
17,  18,  or  19  (depending  on  the  estimate  ^obtained  from  Lemma  16)  with  parameters  e 
and  |.  □ 

Theorem  32  shows  that  step  1  (a)  of  strategy  SGEV  can  be  performed  as  described, 
completing  our  analysis  of  SGEX  . 
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5.4  Experimental  Evaluation 


Following  Cicirello  and  Smith  [19,  21],  we  evaluate  our  algorithm  for  the  max  A-armcd 
bandit  problem  by  using  it  to  select  among  randomized  priority  dispatching  rules  for  the 
resource-constrained  project  scheduling  problem  with  maximal  time  lags  (RCPS P/max). 
Cicirello  and  Smith’s  work  showed  that  a  max  k- armed  bandit  approach  yields  good  per¬ 
formance  on  benchmark  instances  of  this  problem. 

Briefly,  in  the  RCPSP/max  one  must  assign  start  times  to  each  of  a  number  of  ac¬ 
tivities  in  such  a  way  that  certain  temporal  and  resource  constraints  are  satisfied.  Such 
an  assignment  of  start  times  is  called  a  feasible  schedule.  The  goal  is  to  find  a  feasible 
schedule  whose  makespan  is  as  small  as  possible,  where  the  makespan  of  a  schedule  is  the 
maximum  completion  time  of  any  activity. 

Even  without  maximal  time  lags  (which  make  the  problem  more  difficult),  the  resource- 
constrained  project  scheduling  problem  is  NP-hard  and  is  “one  of  the  most  intractable 
problems  in  operations  research”  [63].  When  maximal  time  lags  are  included,  even  the 
feasibility  problem  (i.e.,  deciding  whether  a  feasible  schedule  exists)  is  NP-hard. 

Our  experimental  evaluation  focuses  on  Threshold  Ascent.  In  these  experiments,  we 
found  that  the  number  of  trials  is  small  enough  that  SGEV  never  makes  it  past  the  initial 
exploration  phase,  and  thus  performs  similarly  to  round-robin  sampling. 

5.4.1  Experimental  setup 

In  this  section  we  define  the  RCPSP/max  and  discuss  the  heuristics  and  benchmark  in¬ 
stances  used  in  our  experiments. 


The  RCPSP/max 

Formally,  an  instance  of  the  RCPSP/max  is  a  tuple  X  =  {A,  R.  T),  where  A  is  a  set  of 
activities,  R  is  a  vector  of  resource  capacities,  and  T  is  a  list  of  temporal  constraints. 
Each  activity  a,  e  A  has  a  processing  time  pi,  and  a  resource  demand  r^k  for  each  k  6 
(1,  2, . . . ,  |.R|}.  Each  temporal  constraint  T  G  T  is  a  triple  T  =  (■ i,j,8 ),  where  i  and  j  are 
activity  indices  and  5  is  an  integer.  The  constraint  T  =  ( i,j ,  5)  indicates  that  activity  cij 
cannot  start  until  5  time  units  after  activity  a,:  has  started. 

A  schedule  S  assigns  a  start  time  S(a )  to  each  activity  a  G  A.  S  is  feasible  if 

S(aj)  -S{ai)>  5  V(i,j,  5)  eT 
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(i.e.,  all  temporal  constraints  are  satisfied),  and 


Y  ri.k  <  Rk  Vf  >  0,  k  G  {1,2, . . . ,  \R\} 

ai£A(S,t) 

where  A(S,  t)  —  {a*  e  A  \  S(ai )  <  t  <  S(a,i )  +Pi}  the  set  of  activities  that  are  in  progress 
at  time  t.  The  latter  equation  ensures  that  no  resource  capacity  is  ever  exceeded. 


Randomized  priority  dispatching  rules 

A  priority  dispatching  rule  for  the  RCPSP/max  is  a  procedure  that  assigns  start  times  to 
activities  one  at  a  time,  in  a  greedy  fashion.  The  order  in  which  start  times  are  assigned 
is  determined  by  a  rule  that  assigns  priorities  to  each  activity.  As  noted  above,  it  is  NP- 
hard  to  generate  a  feasible  schedule  for  the  RCPSP/max,  and  a  simple  priority  rule  will 
often  fail  to  find  a  feasible  schedule  in  practice.  Priority  dispatching  rules  are  therefore 
augmented  to  perform  a  limited  amount  of  backtracking  in  order  to  increase  the  odds  of 
producing  a  feasible  schedule.  For  more  details,  see  [66]. 

Cicirello  and  Smith  describe  experiments  with  randomized  priority  dispatching  rules, 
in  which  the  next  activity  to  schedule  is  chosen  from  a  probability  distribution,  with  the 
probability  assigned  to  an  activity  being  proportional  to  its  priority.  Cicirello  and  Smith 
consider  the  five  randomized  priority  dispatching  rules  in  the  set 

H  =  {LPF,  LST,  MST,  MTS,  RSM}  . 

See  Cicirello  and  Smith  [19,  21]  for  a  description  of  these  heuristics.  We  use  the  same 
five  heuristics  as  Cicirello  and  Smith,  with  two  modifications.  First,  we  added  a  form 
of  conflict-driven  backtracking  to  the  procedure  of  [66]  in  order  to  increase  the  odds  of 
generating  a  feasible  schedule.  We  found  that  this  modification  improved  performance  in 
practice.  Second,  we  modified  the  RSM  heuristic  to  improve  its  performance. 

Instances 

We  evaluate  our  approach  on  a  set  of  169  RCPSP/max  instances  from  the  ProGen/max 
library  [74].  These  instances  were  selected  as  follows.  We  first  ran  the  heuristic  LPF  (the 
heuristic  identified  by  Cicirello  and  Smith  as  having  the  best  performance)  10,000  times 
on  all  540  instances  from  the  TESTSETC  data  set  of  the  ProGen/max  library.  For  many 
of  these  instances,  LPF  found  a  (provably)  optimal  schedule  on  a  large  proportion  of  the 
runs.  We  considered  any  instance  in  which  the  best  makespan  found  by  LPF  was  found 
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with  frequency  >  0.01  to  be  “easy”  and  discarded  it  from  the  data  set.  What  remained  was 
a  set  of  169  “hard”  RCPS P/max  instances. 

For  each  “hard”  RCPSP/max  instance,  we  ran  each  heuristic  h  e  H  10,000  times, 
storing  the  results  in  a  file.  Using  this  data,  we  created  a  set  /C  of  169  five-armed  bandit 
problems  (each  of  the  five  heuristics  h  €  H  represents  an  arm).  After  the  data  were 
collected,  makespans  were  converted  to  payoffs  by  multiplying  each  makespan  by  —1  and 
scaling  them  to  lie  in  the  interval  [0, 1]. 


5.4.2  Payoff  distributions  in  the  RCPSP/max 

To  better  understand  the  potential  advantages  and  disadvantages  of  approximating  pay¬ 
off  distributions  by  GEV  distributions,  we  examined  the  payoff  distributions  generated  by 
randomized  priority  dispatching  rules  for  the  RCPSP/max.  For  a  number  of  instances, 
we  plotted  the  payoff  distribution  functions  for  each  heuristic  h  e  H.  For  each  distribu¬ 
tion,  we  fitted  a  GEV  to  the  empirical  data  using  maximum  likelihood  estimation  of  the 
parameters  /i,  a,  and  £,  as  recommended  by  Coles  [23]. 

Our  experience  was  that  the  GEV  sometimes  provides  a  good  fit  to  the  empirical  cu¬ 
mulative  distribution  function  but  sometimes  provides  a  very  poor  fit.  Figure  2  shows 
the  empirical  distribution  and  the  GEV  fit  to  the  payoff  distribution  of  LPF  on  instances 
PSP  12  9  and  PSP  121.  For  the  instance  PSP  12  9,  the  GEV  accurately  models  the  en¬ 
tire  distribution,  including  the  right  tail.  For  the  instance  PSP  121,  however,  the  GEV  fit 
severely  overestimates  the  probability  mass  in  the  right  tail.  Indeed,  the  distribution  in 
Figure  2  (B)  is  so  erratic  that  no  parametric  family  of  distributions  can  be  expected  to  be  a 
good  model  of  its  behavior.  In  such  cases  a  distribution-free  approach  is  preferable. 


5.4.3  An  illustrative  run 

Before  presenting  our  results,  we  illustrate  the  typical  behavior  of  Threshold  Ascent  by 
showing  how  it  performs  on  the  instance  PSP  12 4.  For  this  and  all  subsequent  experi¬ 
ments,  we  run  Threshold  Ascent  with  parameters  n  =  10, 000,  s  =  100,  and  5  =  0.01. 

Figure  3  (A)  depicts  the  payoff  distributions  for  each  of  the  five  arms.  As  can  be  seen, 
LPF  has  the  best  performance  on  PSP  124.  MST  has  zero  probability  of  generating  a 
payoff  >  0.8,  while  LST  and  RMS  have  zero  probability  of  generating  a  payoff  >  0.9.  MTS 
gives  competitive  performance  up  to  a  payoff  of  t  ~  0.9,  at  which  point  the  probability  of 
obtaining  a  payoff  >  t  suddenly  drops  to  zero. 
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(A)  Instance  PSP129 


(B)  Instance  PSP121 


Figure  5.3:  Empirical  cumulative  distribution  function  of  the  LPF  heuristic  for  two 
RCPSP/max  instances.  (A)  depicts  an  instance  for  which  the  GEV  provides  a  good  fit; 
(B)  depicts  an  instance  for  which  the  GEV  provides  a  poor  fit. 
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(A)  Payoff  Distributions 


(B)  Behavior  of  Threshold  Ascent 


Figure  5.4:  Behavior  of  Threshold  Ascent  on  instance  PSP  12 4.  (A)  shows  the  payoff 
distributions;  (B)  shows  the  number  of  pulls  allocated  to  each  arm. 
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Table  5.1:  Performance  of  eight  max  k- armed  bandit  strategies  on  169  RCPSP/max  in¬ 
stances. 


Strategy 

E  Regret 

P  [Regret  =  0] 

Num.  Feasible 

Threshold  Ascent 

188 

0.722 

166 

Round-robin  sampling 

345 

0.556 

166 

LPF 

355 

0.675 

164 

MTS 

402 

0.657 

166 

QD-BEACON 

609 

0.538 

165 

RSM 

2130 

0.166 

155 

LST 

3199 

0.095 

164 

MST 

4509 

0.107 

164 

Figure  3  (B)  shows  the  number  of  pulls  allocated  by  Threshold  Ascent  to  each  of  the 
five  arms  as  a  function  of  the  number  of  pulls  performed  so  far.  As  can  be  seen,  Threshold 
Ascent  is  a  somewhat  conservative  strategy,  allocating  a  fair  number  of  pulls  to  heuristics 
that  might  seem  “obviously”  suboptimal  to  a  human  observer.  Nevertheless,  Threshold 
Ascent  spends  the  majority  of  its  time  sampling  the  single  best  heuristic  (LPF). 


5.4.4  Results 

For  each  instance  K  e  /C,  we  ran  three  max  A-armcd  bandit  algorithms,  each  with  a  bud¬ 
get  of  n  =  10,  000  pulls:  Threshold  Ascent  with  parameters  n  =  10,  000,  s  =  100,  and 
<5  =  0.01,  the  QD-BEACON  algorithm  of  Cicirello  and  Smith  [21],  and  an  algorithm  that 
simply  sampled  the  arms  in  a  round-robin  fashion.  Cicirello  and  Smith  describe  three  ver¬ 
sions  of  QD-BEACON;  we  use  the  one  based  on  the  GEV  distribution.  For  each  instance 
K  6  /C,  we  define  the  regret  of  an  algorithm  as  the  difference  between  the  minimum 
makespan  (which  corresponds  to  the  maximum  payoff)  sampled  by  the  algorithm  and  the 
minimum  makespan  sampled  by  any  of  the  five  heuristics  (on  any  of  the  10,000  stored 
runs  of  each  of  the  five  heuristics).  For  each  of  the  three  algorithms,  we  also  recorded 
the  number  of  instances  for  which  the  algorithm  generated  a  feasible  schedule.  Table  1 
summarizes  the  performance  of  these  three  algorithms,  as  well  as  the  performance  of  each 
of  the  five  heuristics  in  isolation. 

Of  the  eight  max  k- armed  bandit  strategies  we  evaluated  (Threshold  Ascent,  QD- 
BEACON,  round-robin  sampling,  and  the  five  pure  strategies),  Threshold  Ascent  has  the 
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least  regret  and  achieves  zero  regret  on  the  largest  number  of  instances.  Additionally, 
Threshold  Ascent  generated  a  feasible  schedule  for  the  166  (out  of  169)  instances  for 
which  any  of  the  five  heuristics  was  able  to  generate  a  feasible  schedule  (for  three  in¬ 
stances,  none  of  the  five  randomized  priority  rules  generated  a  feasible  schedule  after 
10,000  runs). 

5.4.5  Discussion 

Two  of  the  findings  summarized  in  Table  1  may  be  surprising:  the  fact  that  round-robin 
sampling  performs  better  than  any  single  heuristic,  and  the  fact  that  QD-BEACON  per¬ 
forms  worse  than  round-robin.  We  now  examine  each  of  these  findings  in  more  detail. 


Why  Round-Robin  Sampling  Performs  Well 

In  the  classical  k- armed  bandit  problem,  round-robin  sampling  can  never  outperform  the 
best  pure  strategy  (where  a  pure  strategy  is  one  that  samples  the  same  arm  the  entire  time), 
either  on  a  single  instance  or  across  multiple  instances.  In  the  max  k- armed  bandit  prob¬ 
lem,  however,  the  situation  is  different,  as  the  following  example  illustrates. 

Example  5.  Suppose  we  have  2  heuristics,  and  we  run  them  each  for  n  trials  on  a  set  of 
m  instances.  On  half  the  instances,  heuristic  A  returns  payoff  0  with  probability  0.9  and 
returns  payoff  1  with  probability  0.1,  while  heuristic  B  returns  payoff  0  with  probability 
1.  On  the  other  half  of  the  instances,  the  roles  of  heuristics  A  and  B  are  reversed. 

If  n  is  large,  round-robin  sampling  will  yield  total  regret  r*  0,  while  either  of  the  two 
heuristics  will  have  regret  r*  |m.  By  allocating  pulls  equally  to  each  arm,  round-robin 
sampling  is  guaranteed  to  sample  the  best  heuristic  at  least  |  times,  and  if  n  is  large  this 
number  of  samples  may  be  enough  to  exploit  the  tail  behavior  of  the  best  heuristic. 


Understanding  QD-BEACON 

QD-BEACON  is  designed  to  converge  to  a  single  arm  at  a  doubly-exponential  rate.  That  is, 
the  number  of  pulls  allocated  to  the  (presumed)  optimal  arm  increases  doubly-exponentially 
relative  to  the  number  of  pulls  allocated  to  presumed  suboptimal  arms.  In  our  experience, 
QD-BEACON  usually  converges  to  a  single  arm  after  at  most  10-20  pulls  from  each  arm. 
This  rapid  convergence  can  lead  to  large  regret  if  the  presumed  best  arm  is  actually  sub- 
optimal. 
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5.5  Conclusions 


The  max  k- armed  bandit  problem  is  a  variant  of  the  classical  A-armcd  bandit  problem  with 
practical  applications  to  combinatorial  optimization. 

We  presented  an  algorithm,  Chernoff  Interval  Estimation,  for  solving  the  classical  k- 
armed  bandit  problem,  and  proved  that  it  has  good  performance  guarantees  when  the  mean 
payoff  returned  by  each  arm  is  small  relative  to  the  maximum  possible  payoff.  Building 
on  Chernoff  Interval  Estimation  we  presented  an  algorithm.  Threshold  Ascent,  that  solves 
the  max  k- armed  bandit  problem  without  making  strong  assumptions  about  the  payoff 
distributions.  We  demonstrated  the  effectiveness  of  Threshold  Ascent  experimentally  on 
the  problem  of  selecting  among  randomized  priority  dispatching  rules  for  the  RCPSP/max. 

Motivated  by  extreme  value  theory,  we  then  studied  a  restricted  version  of  this  problem 
in  which  each  arm  yields  payoff  drawn  from  a  GEV  distribution.  We  derived  bounds  on 
the  number  of  samples  required  to  accurately  estimate  the  expected  maximum  of  n  draws 
from  a  GEV  distribution.  Using  these  bounds,  we  showed  that  a  simple  algorithm  for  the 
max  A-armcd  bandit  problem  is  asymptotically  optimal.  Ours  is  the  first  algorithm  for  this 
problem  with  rigorous  asymptotic  performance  guarantees. 
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Chapter  6 
Conclusions 


In  this  thesis,  we  developed  techniques  for  solving  hard  computational  problems  more 
efficiently.  Toward  this  end,  we  introduced  several  new  online  optimization  problems,  and 
developed  online  algorithms  for  solving  these  problems.  Our  online  algorithms  come  with 
rigorous  performance  guarantees,  stated  either  as  regret  bounds  or  as  a  competitive  ratio. 
Experimentally,  we  showed  that  these  techniques  can  be  used  to  improve  the  performance 
of  state-of-the-art  algorithms  in  a  wide  variety  of  problem  domains. 

Interpreted  narrowly,  the  contributions  of  this  thesis  consist  of  new  theoretical  and  ex¬ 
perimental  results  for  three  previously-studied  problems:  algorithm  portfolio  design,  using 
decision  procedures  efficiently  for  optimization ,  and  the  max  k-armed  bandit  problem.  In 
each  case,  our  results  provide  new  ways  to  improve  the  performance  of  certain  classes  of 
algorithms. 

Interpreted  more  broadly,  this  thesis  presents  three  successful  examples  of  a  high-level 
strategy  for  solving  NP-hard  problems:  namely,  to  leverage  the  power  of  existing  heuristics 
in  a  principled  way.  This  strategy  seems  under-exploited  at  the  moment,  and  we  hope  our 
work  will  encourage  more  people  to  pursue  it. 
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Appendix  A 
Additional  Proofs 


A.l  Online  Algorithms  for  Maximizing  Submodular  Func¬ 
tions 


Theorem  5.  c  (/,  G')  <  4  J™Q  1  -  maxSg5  {/  (S{t))  }  dt  <  4 min565  {c  (/,  S)}. 

Proof.  Recall  that  G'  =  (g[,  gf  . . .),  where  G'  =  (gf  gf  . . . ,  and 

,  /(G'©(u,r))-/(G') 

^  /;=0 1  -  /  (G'  +  (n,  t'))  • 


(A.l) 


Let  s'-  equal  the  jth  value  of  the  arg  max  in  (A.l),  multiplied  by  the  quantity  1  —  /(G'-). 
We  will  make  use  of  the  following  claim. 

Claim  1.  For  any  schedule  S,  any  positive  integer  j,  and  any  t  >  o.  /  (SW)  <  f  (Gj)  + 


Proof.  Fix  an  action  a  =  (v,  r).  By  monotonicity  of  /,  we  have  fj=0  1  —  /  (G'-  ©  ((v,  r)))  dt'  < 
r(l  —  /  (Gj)),  or  equivalently, 

i-m) 

T  XLo1  -  /  (Gi  +  (»’.T))  ’ 

This  and  the  definition  of  s'  imply 

/  (G'  0  (a))-/  (G')  /  (gj  ©  (g))  -  /  (ga  <  , 

M  jjj7tL0l-/(G'©((n,t')))dt'-^" 
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The  claim  then  follows  by  exactly  the  same  argument  that  was  used  to  prove  Fact  1.  □ 


The  remainder  of  the  proof  parallels  the  proof  of  Theorem  4.  Using  Claim  1  and  the 
argument  in  the  proof  of  Theorem  4,  we  get  that 


max 

ses 


{/(««))}*>  E  xi(yj  ~  vh i) 

j>i 


where  Xj  =  Af,  yj  =  A2-,  and  Rj  —  1  —  f  (Gj).  Letting  g'}  =  (vj,  Tj),  we  have 

xi(yj  ~  %■+!)  =  t  f  _ 1  -  /  (Gj  ©  <(uj.  0» dt'  =  \c  (/> G')  • 

j>i  i>i  l/*,=0 


which  proves  the  theorem. 


□ 


We  now  prove  the  theorems  concerning  the  performance  of  the  greedy  schedule  G,  in 
which  the  jth  evaluation  of  the  argmax  in  (2.4)  is  performed  with  additive  error  er  To 
ease  notation,  let  G  =  {gi,  g2,  ■  ■  ■),  where  gj  =  (vj,  Tj).  Let  s3  =  _  t0  pr0ve 

Theorems  6  and  7,  we  will  make  use  of  the  following  fact,  which  can  be  proved  in  exactly 
the  same  way  as  Fact  1. 

Fact  4.  For  any  schedule  S,  any  positive  integer  j,  and  any  t  >  0,  we  have  f(S^)  < 
f{Gj)  +  t  ■  ( Sj  +  6j). 


Theorem  6.  Let  L  be  a  positive  integer,  and  let  T  =  Ylj=i  Tj,  where  gj  =  (vj,  Tj).  Then 
f  (G<t>)  >  max  {/  (S{T))  }  -  ^  ejTj  . 

\  '  7  =  1 


Proof.  Let  C*  =  ma uses  {/  (-S'(t))  },  and  for  any  positive  integer  j,  let  A  j  =  C*  —  f  {Gj). 
By  Fact  4,  C*  <  f  ( Gj )  +  T(sj  +  €j).  Thus 


Aj  <  T{sj  +  €j) 


T 


A, -A 


3+ 1 


To 


+  6 
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Rearranging  this  inequality  gives  AJ+1  <  Aj  (l  —  ^)  +  T3er  Unrolling  this  inequality 
(and  using  the  fact  that  1  —  ^  <  1  for  all  j),  we  get 

Al+i  <  Ai  1  -  +  X!  T'Je'J  ' 

\j= i  /  j= i 

Let  E  =  =  i  UU-  Subject  to  the  constraint  ,  r?-  =  T,  the  product  series  is  maxi¬ 

mized  when  Tj  =  y  for  all  j.  Thus  we  have 

C*-f  (Gl+1)  =  Al+1  <  A1  (l  -  ^  +E<A1-e+E<C*-e+E. 

Thus  /  (Gi+i)  >  (1  —  )C *  —  E,  as  claimed.  □ 


Theorem  7.  Let  L  be  a  positive  integer,  and  let  T  =  = ,  r?-,  where  g:i  =  (ig,  Tj).  For 

any  schedule  S,  define  cT  (/,  S )  =  J'=()  1  —  /  (Sy))  dt.  Then 

/CO  L 

1  “  s1!#  ^  }  dt  +  EiTi  ■ 

-°  j= i 

w/rcrc  Ej  =  ^ t<j  e^. 

Proof.  Let  Rj  —  1  —  /  (Gy),  let  /?(  =  /y  —  Ej.  Assume  for  the  moment  that  /A  >  Ej., 
so  that  Rj  is  non-negative  for  j  <  L.  Let  s'j  =  Sj  +€j.  By  construction, 

~  R'h i  =  /  (Gj+i)  -  /  (&,)  +  e3r3  =  r,S'  .  (A.2) 

Ft!  Rf 

Let  Xj  —  a4;  let  Uj  —  and  let  /i(x)  =  1  —  max 5  {/  (S(x)) }.  By  Fact  4, 

max  {/(^>)}  <  /  (Gy)  +  ays'-  =  /  (Gf  +  ^  . 

Thus  li(xj)  >  Rj  —  E  =  R]REj  >  yj.  The  monotonicity  of  /  implies  that  /i(x)  is  non¬ 
increasing  and  (together  with  the  fact  that  Ej  is  non-decreasing  as  a  function  of  j)  implies 


159 


that  the  sequence  (2/1 , 2/2 ,  •  • .)  is  non-increasing.  As  illustrated  in  Figure  2.1,  these  facts 
imply  that  f^Q  h(x )  dx  >  Y^j= 1  T?  (Vj  ~  Vj+ i)-  Thus  we  have 


max  {/  (%)}  dt  —  j  h(x)  dx 


'  x=0 


3= 1 


i=i 

4X>; 


,  w  - %.) 


j=i 


1 

4 


i=i  i>i 

i=i 


jri 


jri 


(Figure  2.1) 


(equation  (A. 2)) 


(mono tonicity  of  /) 


which  proves  the  theorem,  subject  to  the  assumption  that  Rl  >  -El- 

Now  suppose  -Rl  <  EL.  Let  K  be  the  largest  integer  such  that  RK  >  EK,  and  let 
Tk  =  1  ri-  By  the  argument  just  given, 


cTlt(f,G)<  4 


/*OC  ^ 

/  1  -  {•/'  (5(*>)  }  ^  +  E  EEj  ■ 

Jt= 0  Ae6  j=1 


Thus  to  prove  the  theorem,  it  suffices  to  show  that  cT  (/,  G)  <  cTk  (/,  G )  +  J2j=k+i  EjTj- 
This  holds  because 

ct(/,G)-ctM/,G)=  [  1  -f(Gm)dt 

J  t=TK 

<(T-TK)(l-f(G{TK})) 

—  (T  —  Tk)Rk+ 1 
<  (T  —  Tk)Ek+i 

l 

—  ^T-7  • 
j=X+l 
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□ 


Lemma  2.  Any  sequence  (fi,  f2, ... , ./«)  of  jobs  satisfies  Condition  2.  That  is,  for  any 
sequence  S\,S2, . . . ,  Sn  of  schedules  and  any  schedule  S, 

EIU  Msi  ®s)~  ffs,)  ^  f  Ms*  ©  ((e r)))  -  fi(si)  \ 

-  -  \  1 1 1  <r  lX  <  -  >  . 

£{S)  (t),r)gVxi>0  (  r  J 

Proof.  Let  r  denote  the  right  hand  side  of  the  inequality.  Let  S  =  (a, ,  a2, . . . ,  ar),  where 
ai  =  ( vt,n ).  Let 

A ij  =  fi(S i  ©  (a  1,^2,  •  •  • ,  a/))  ~  /('S'*  ©  (ai,  «2,  •  •  • ,  aj_i))  • 


We  have 


E/.(si®s)  =  E 
2=1  2=1 
n 

^E 


/.(so+EA 


/=i 


/.(S.)  +  V  (/.(S,  ffi  (a,))  -  f(S,)) 


1=1 


n  L  n 

=  E  +  E  E  /($» 

2=1  Z  =  1  2=1 


n  L 

<  ^2fi(si)  +  ^2r  -ti 
i= i  z=i 


=  E/-(si)+r'<’(s) . 

2=1 


(telescoping  series) 


(submodularity) 


(definition  of  r) 


Vn_  f  (S  (BS)—f  (S  ) 

Rearranging  this  inequality  gives  JM\s)  <  G  as  claimed.  □ 

Lemma  4.  Algorithm  OGunit  wzY/i  randomized  weighted  majority  as  the  subroutine  ex¬ 
perts  algorithm  has  E  [R]  —  O  ( \jTn  In  El|)  in  the  worst  case. 
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Proof.  Let  k  =  \A\.  Let  x,  be  the  total  payoff  received  by  £t,  and  let  gt  —  xt  +  rt  be  the 
total  payoff  that  could  have  been  received  by  £t  in  hindsight  (had  it  been  forced  to  choose 
a  fixed  expert  each  day).  Because  ffjt=i  xt  —  n’  we  have  Ylt= 1 fh  <  n  +  R-  Using  WMR, 
E  [rt]  =  O  (  \/()t  in  k) .  Using  WMR,  the  actual  value  of  rt  will  be  tightly  concentrated 
about  its  expectation,  as  can  be  shown  using  Azuma’s  inequality.  In  particular,  because 
gt  <  n,  the  probability  that  R  >  n  is  exponentially  small.  Assuming  R  <  n,  we  have 
YlJ=i  9t  —  2 n.  Subject  to  this  constraint,  1  sfdt  is  maximized  when  gt  —  fif  for  all  t. 

Thus  in  the  worst  case,  E  [R\  =  O  {^jTn In  k^j .  □ 

In  order  to  prove  Theorem  10,  we  first  prove  the  following  lemma.  The  lemma  relates 
the  expected  cost  of  the  schedule  S)  (selected  by  OG  on  round  i)  to  the  expected  cost 
Si  would  incur  if,  hypothetically,  each  of  the  “meta-actions”  selected  by  each  experts 
algorithm  £t  consumed  unit  time  on  every  job  (require  that  this  assumption  was  made  in 
the  analysis  in  the  main  text). 

Lemma  20.  Fix  a  sequence  of  jobs  {fi,  f 2,  ■■■ ,  fn)  and  an  integer  i  (1  <  i  <  n).  Let  Si  be 
the  schedule  produced  by  OG  to  use  on  job  fi,  and  let  S,  j_  1  denote  the  partial  schedule 
that  exists  after  the  first  t  —  1  experts  algorithms  has  selected  actions.  Then 

"  L 

E  (fi,  $)]  <  E  (1  -  ffiSi,.,))  . 

_t= 1 

Proof.  Fix  some  t.  Let  aj  =  (v,  r)  be  the  action  selected  by  8,  on  round  i,  and  define 

ftL0  1  -  fi(Si,t-i  ®  (( v ,  t')))  df  if  a\  is  appended  to  St 
0  otherwise. 

By  construction,  c1^  (fi,  Si)  =  Y^t=i  ct ■  Because  alt  is  appended  to  Si  with  probability 
k,  and  because  ft  is  monotone,  we  have 

E  [<*\Si>t-i]  =  -  r  1  -  fi(Si,t- 1  ©  i(v,t')))  dt'  <  1  -  MSit-i)  . 

T  Jt'= 0 

Taking  the  expectation  of  both  sides  yields  E  [cj]  <  E  [1  —  fi(Sijt_  1)].  Then  by  linearity 
of  expectation, 

E[c^)  (fi,Si)}  —  E 

□ 
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Theorem  10.  Algorithm  OG,  run  with  input  L  =  Tin n,  has  E  [Rcost]  =  O (T In n  ■  E  [R] 
+  T^/n).  In  particular,  E  [Rcost]  —  O  (pnn)*Ty/Tnhi\A\)  if  WMR  is  used  as  the 
subroutine  experts  algorithm. 


Proof.  The  arguments  in  the  main  text  showed  that  OG  can  be  viewed  as  a  version  of  the 
greedy  schedule  for  the  function  /  =  4  ™=  i  fu  in  which  the  tlh  decision  is  made  with 

additive  error  et,  under  the  assumption  that  all  “meta-actions”  a\  require  unit  time  on  every 
job.  Thus  by  Theorem  7,  we  have 

(1  -  fi(Si,t- 1))  <  4  •  min  l  c  (/*,  S)  l  +  riL  et .  (A.3) 

i=  1  t= 1  l  i=  1  J  t=  1 

Also  recall  from  the  main  text  that  E  [et]  =  E  [— ] ,  where  rt  is  the  regret  experienced  by 
£t,  and  that  we  define  R  =  : ,  rt.  Thus,  we  have 


E 


<  E 


EE(>- MSif-i)) 


i=  1  t= 1 


(Lemma  20) 


<  4  •  min 

ses 


Ec«’s)| 

i= 1  J 


+  L-E[R\  . 


(equation  A.3) 


If  it  was  always  the  case  that  I  (Sf)  >  T,  then  we  would  have  cT  if,,  Sr)  <  ce(sh  (f,,  Si), 
and  this  inequality  would  imply  E  [Rcost]  <  L  ■  E  [R],  In  order  to  bound  E  [R.COst\ ,  we  now 
address  the  possibility  that  £  (Sf)  <  T.  Letting  pt  =  P  [f(*S))  <  T],  we  have 


e[ct(^,/?:); 


=  (1  -  Pi)  ■  E  [cT  (Si,  fi )  \£(Si)  >T]+Pi-  E  [cT  (Si,  fi )  \£(Si)  <  T] 
<E[^S‘)  (fi,Si)]  +  Pi  ■  T  . 


Putting  these  inequalities  together  yields 


IE  [7?cost]  SL  L  •  E  [R]  +  T  ^  ^  p^ .  (A. 4) 

i= 1 


We  now  bound  pt.  As  already  mentioned,  E  [£  (S',)]  =  L  regardless  of  which  actions 
are  selected  by  the  various  experts  algorithms.  If  L  S>  T,  then  £  (Sf)  will  be  sharply 
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concentrated  about  its  mean,  as  we  can  prove  using  standard  concentration  inequalities 
(e.g..  Theorem  5  of  [18]).  In  particular,  for  any  A  >  0,  we  have 


P[C(Si)<L-A]<=exp(— 

Setting  A  =  L  —  T  and  simplifying  yields  p{  <  exp  (— X  +  l).  Setting  L  =  T  In  n 
then  yields  pi  <  ^=,  so  the  right  hand  side  of  (A.4)  is  0(Ty/n).  Thus  E  [Rcost]  = 
O  (T  Inn  •  E  [i?]  +  T \fn),  as  claimed.  Substituting  the  bound  on  E  [R]  stated  in  Lemma  4 
then  proves  the  claim  about  WMR.  □ 

Theorem  11.  Algorithm  OGp,  run  with  WMR  as  the  subroutine  experts  algorithm,  has 
E  [Roverage]  =  O  ^ (C  In  |  A  | )  3  (Tn)  §  j  ( when  run  with  input  L  =  T)  and  has  E  [Rcost\  = 

O  ^(Tlnn)i(CTn  |Al|)5(n)I  j  (when  run  with  input  L  =  Tin  n)  in  the  priced  feedback 
model. 


Proof.  Let  M  be  the  number  of  exploration  rounds  (so  E  [. M ]  =  y n).  The  maximum 
payoff  to  any  single  expert  cannot  exceed  M.  Thus,  by  Lemma  5  and  the  regret  bound  of 


WMR,  we  have  E  [rt|M] 


o(RJmr\a\). 


Using  the  fact  that  E 


Vx 


<  a/E  [X' 


for  any  random  variable  X,  this  implies 


E  [r t] 


E  [E  [rt\M}\  =  O  ^y/E  \M\  In  \A\^j  =  O  U^hi\A\^  . 


By  Theorem  9,  we  have  E  [RCOVerage\  <  E  [R]  +  Cyn  =  O  (j^^J^\n\A\j  +  Cyn.  Setting 

2 

7  =  ( then  yields  E  [Rcoverage]  =  O  (jCTn \A\)*  (Ln)^ ,  as  claimed. 

Similarly,  by  Theorem  10,  we  have  E  [Rcost\  <  L  •  E  [R]  +  Ty/n  +  TCyn  =  L  ■ 
O  In  \A\  +  Uynj ,  so  the  same  setting  of  7  yields  E  [. Rcost ]  =  O  (C  In  \  A\)^vA  j. 

□ 


Theorem  13.  Algorithm  OG°,  run  with  WMR  as  the  subroutine  experts  algorithm,  has 
E  [ Rcoverage }  =  O  ^T(|«4|  In  \A\)^n%  j  (when  run  with  input  L  =  T)  and  has  E  [Rcost]  = 

O  (7t  In  ■nf1(\A\  In  \A\)^'uA  j  (when  run  with  input  L  =  Thin)  in  the  opaque  feedback 
model. 
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Proof.  We  showed  in  the  main  text  that  E  \x^a]  =  where  x\  a  is  the  es¬ 

timated  payoff  fed  back  by  OG°  and  xlt  a  is  the  true  payoff.  Thus  by  Lemma  5,  E  [rt\ 
is  bounded  by  times  the  worst-case  regret  of  £t.  Using  the  same  argument  we  used 

in  the  proof  of  Theorem  11,  we  get  E  [R]  =  O  (^L<J ^  In  where  7'  =  By 

Theorem  9,  we  have  E  [RCOverage\  <  E  [R]  +  7 n  =  O  (L^  In  ~\A\ j  +  C^'n,  where 

2 

C  =  L  |^4.| .  As  in  the  proof  of  Theorem  9,  setting  7'  =  then  yields 

E  [Rcoverage]  =  O  ^(67 In \A\ ) ^  (Ln) =  O  (t( \A\  In  |.A| ) 5 7T, i  j ,  and  the  same  setting 
of  7'  yields  E  [Rcost\  =  O  (^L'i(C  In \A\)^ni  j  =  O  ^(Tlnn)2(|^4|  ln \A\)^vJ  j .  □ 

We  now  prove  lower  bounds  on  regret.  As  mentioned  in  the  main  text,  our  lower 
bounds  will  hold  for  the  online  versions  of  Max  /c-Coverage  and  Min-Sum  Set 
Cover. 

We  consider  the  following  online  version  of  Max  k- Coverage.  One  is  given  a  col¬ 
lection  C  of  sets,  where  each  set  in  C  is  a  subset  of  a  universe  E  =  (ei,  e2, _ ,  en}.  One 

cannot  examine  the  sets  (or  even  determine  their  cardinalities)  directly.  On  round  i  of  the 
game,  one  must  specify  a  subcollection  C  C  C,  with  | C\  =  k.  One  then  receives  a  re¬ 
ward  of  1  if  element  e,  belongs  to  some  set  in  the  collection,  and  receives  a  reward  of  zero 
otherwise.  One  then  learns  as  feedback  which  sets  e,  belonged  to. 

This  problem  is  a  special  case  of  the  online  version  of  Budgeted  Maximum  Sub- 
modular  Coverage.  To  see  this,  let  V  —  C  be  the  set  of  activities,  and  think  of  the 
action  (v,  r)  as  including  the  set  v  in  the  collection  assuming  r  >  1,  and  having  no  effect 
otherwise.  For  any  schedule  S,  let  ffS)  =  1  if  one  of  the  sets  added  to  the  collection  by 
S  contains  ei?  and  let  ffS)  =  0  otherwise.  Then  Budgeted  Maximum  Submodular 
Coverage  on  the  sequence  of  jobs  (/j,  /2, . . . ,  fn),  with  time  limit  T  =  k,  is  exactly  the 
problem  just  described. 

The  online  version  of  Min-Sum  Set  Cover  is  similar,  except  that  instead  of  speci¬ 
fying  a  subcollection  of  cardinality  k,  one  specifies  a  sequence  of  k  sets  from  C.  One  then 
incurs  a  loss  equal  to  the  number  of  sets  one  must  look  through  in  the  sequence  in  order 
to  find  ei,  or  a  loss  of  k  if  et  does  not  appear  in  the  sequence  at  all.  By  the  arguments  just 
given,  this  is  equivalent  to  online  Min-Sum  Submodular  Cover  on  the  sequence  of 
jobs  (fi,  /2, . . . ,  fn),  where  T  =  k  is  the  time  at  which  schedule  costs  are  truncated. 

To  prove  lower  bounds  on  regret,  we  will  require  the  following  technical  lemma.  The 
proof  is  a  straightforward  generalization  of  the  proof  of  Lemma  3.2.1  of  [16],  which  con- 
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sidered  the  special  case  p—\. 

Lemma  21  ([16]).  Let  X | ,  X2 , . . .  ,XS  be  s  independent  random  variables,  where  Xt 
equals  the  number  of  heads  in  n  flips  of  a  coin  with  bias  p.  Let  p  =  np  and  let  a  = 
\Jnp{l  —  p).  Then 


E  [max  (X] ,  X2, . . . ,  Xs}]  —  p  +  Ll  ^oVln  s  j  . 

Theorem  14.  Any  algorithm  for  online  Max  £>Coverage  has  worst-case  expected  1- 
regret  17  ^  \Jt  n  \  n  ,  where  V  is  the  collection  of  sets  and  T  =  k  is  the  number  of  sets 
selected  by  the  online  algorithm  on  each  round. 

Proof.  Let  V  be  a  collection  of  sets.  On  each  round  of  the  online  game,  whether  or  not  a 
given  set  covers  the  element  will  be  determined  by  flipping  a  coin  of  bias  P  —  4f  -  Thus, 
regardless  of  which  T  sets  are  selected  by  the  online  algorithm,  the  probability  that  it 
covers  the  element  is  q  =  1  —  (l  —  ^ )T  e  [|,  ^],  and  the  expected  number  of  elements 
the  online  algorithm  covers  is  nq. 

We  now  consider  the  number  of  elements  that  could  have  been  covered  in  hindsight. 

Let  R  =  \J f  In  .  Partition  V  into  T  bins,  each  of  size  s  —  Let  S*  denote  the  set  in 

the  ith  bin  which  covers  the  largest  number  of  elements,  and  let  C*  =  {,57 ,  ,57  , . . . ,  ,5'f}.  To 
prove  the  theorem,  it  suffices  to  show  that  C*  covers  nq  +  Ll  (T 11)  elements  in  expectation. 

Let  a  collection  C  =  {57,  S2, . . . ,  ST}  consist  of  a  random  set  drawn  from  each  bin. 
In  expectation  C  covers  nq  elements.  Let  xr  :=  \S*\  —  |,57  and  note  that  xt  >  0  and 
E  [xi\  =  17  (R)  by  Lemma  21.  Randomly  mark  Xi  elements  of  S*  and  let  Mr  and  Ut 
denote  the  marked  and  unmarked  elements  of  S*,  respectively.  Note  that  the  collection 
{Ui  :  1  <  i  <  T\  covers  nq  elements  in  expectation.  Let  X  denote  the  (random)  number 
of  additional  elements  covered  by  the  collection  { M,  :  1  <  i  <  T}  (i.e.,  X  —  \  Uj  M,  — 
UiUi\).  We  claim  that  E  [A"]  =  17  (TR).  To  prove  this,  define  7  to  be  the  event  “for  all 
S  G  C,  1 5 1  <  n/T”  and  let  Y  be  the  number  of  marked  elements  covered  exactly  once  in 
C*.  We  will  show  that  E  [Y  |  f]  P  [f]  =  17  {TR).  Since  E  [Y  \  f]  •  P  [^]  <  E  [Y]  <  E  [X], 
this  is  sufficient  to  complete  the  proof. 

Fix  i  and  any  element  e  G  Mt.  Then  P  [e  uniquely  covered  71  =  (l  —  |S-|/n)  > 
(1  —  1/T )T^1  >  1/e.  This  implies  E  [Y  \  71]  >  ^E  [^7^  |Mj|]  =  {17  ( TR ),  since,  as  men¬ 
tioned,  E[|Mj|]  =  17  (R)  for  all  i.  Finally,  the  Chernoff  bound  easily  yields  P  [£]  > 
(1  —  |V|  •  exp  {—  n/8T})  =  1  —  o(l),  and  so  E  [Y  |  ^]  ■  P  [^]  =17  {TR)  as  claimed.  □ 
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The  lower  bound  in  Theorem  14  is  optimal  up  to  constant  factors.  To  see  this,  observe 
that  running  randomized  weighted  majority  with  one  expert  for  each  of  the  (^)  possi¬ 


ble  collections  of  T  sets  yields  worst-case  regret  O 


\J n  in  (l^1)^  =  O  nT  In ^ 


for  online  Max  A;-Coverage,  using  the  fact  that  ('£')<  .  Similarly,  using  a 

separate  expert  for  each  of  the  O  ^  |  V  | 7  j  possible  permutations  of  T  sets  yields  regret 

o  (r/r  n  In  | vQ  for  online  Min-Sum  Set  Cover,  which  shows  that  the  lower  bound 
in  Theorem  15  is  optimal  up  to  logarithmic  factors. 

Theorem  15.  Any  algorithm  for  online  Min-Sum  Set  Cover  has  worst-case  expected 
1 -regret  Q  (^T  y f[ Tn  In  ,  where  V  is  a  collection  of  sets  and  T  is  the  number  of  sets 

selected  by  the  online  algorithm  on  each  round. 

Proof  We  use  the  same  construction  as  in  the  proof  of  Theorem  14.  Define  the  coverage 
time  of  a  schedule  St  =  (,Sj,  Sf  . . . ,  Sf)  to  be  the  smallest  t  such  that  S)  covers  the  ith 
element,  or  T  if  no  such  t  exists.  As  in  the  proof  of  Theorem  14,  the  probability  that  the 
online  algorithm  covers  any  particular  element  is  q.  Given  that  the  online  algorithm  covers 
an  element,  the  expected  coverage  time  is  zT  for  some  z<\-  Thus,  any  online  algorithm 
has  expected  coverage  time  t  =  qzT  +  (1  —  q)T  for  each  element. 

Now  consider  the  schedule  S*  =  (S*,S%, . . . ,  Sf),  where  S*  =  f/jUMj  was  defined  in 
the  proof  of  Theorem  14,  and  let  the  sets  be  indexed  in  random  order.  The  schedule  U  = 
(Ci,  U2,  ■  ■  - ,  UT)  is  statistically  equivalent  to  a  random  schedule,  and  thus  has  expected 
coverage  time  t  per  element.  Using  S*  in  place  of  U  causes  X  additional  elements  to  be 

covered,  where  E  [A"]  =  O  l~Tn  In  .  Because  the  sets  in  S*  are  ordered  randomly, 

the  expected  coverage  time  for  each  of  the  X  additional  elements  is  at  most  f.  Thus, 
the  total  expected  coverage  time  of  S*  is  smaller  than  that  of  U  by  at  least  (  E  [A"]  = 


Q  (  Tx/Tn  InM 


□ 


A.2  Combining  Multiple  Heuristics  Online 

Lemma  9.  For  any  schedule  S  and  any  a  >  1,  there  exists  an  a-regular  schedule  Sa  such 
that,  for  any  instance  x,  E  [T  (Sa,  x)\  <  a2  ■  E  [T  (S,  a:)].  In  the  specicd  case  where  all 
heuristics  are  executed  in  the  suspend-and-resume  model,  E  [T  (Sn,  a;)]  <  a:  -  E  T  ( S ,  x)]. 


167 


Proof.  For  any  profile  P  =  (r, ,  t2,  . . . ,  Tj),  let  \P]  be  the  a-rcgular  profile  obtained  as 
follows:  we  first  round  each  r,  up  to  the  nearest  power  of  a.  Then,  we  round  the  number 
of  runs  of  each  length  up  to  the  nearest  floor  of  a  power  of  a  (i.e.,  the  nearest  member  of 
{|_cFj  :  i  G  Z}).  For  example  if  a  =  2  and  P  =  (4,  3,  3,  2),  then  |~P]  =  (4,4, 4, 4,  2). 
Note  that  the  size  of  |~P]  is  at  most  a 2  times  the  size  of  P  (recall  that  the  size  of  a  profile 
P  =  {t1,t2,...,  tl)  equals  ffLi  T*)- 

For  any  state  Y  =  (P, ,  P2, . . . ,  /(,),  define 

iYi  =  <rPii,rp2i,,..,rpfci) 

Again,  note  that  the  size  of  |~Y]  is  at  most  a 2  times  the  size  of  Y. 

Fix  some  schedule  S,  and  consider  the  set  of  states  (3;(5'{t>)  :  t  >  0}.  Let  (Yi,  Y2, . . .) 
be  a  list  of  the  elements  of  this  set,  arranged  in  increasing  order  of  size.  As  a  simple 
example,  if  \H\  =  2,  S  =  {(hi,  3),  (ho,  1)),  and  a  =  2,  then  the  set  contains  five  elements: 
Yi  =  «>,  ()),  ^2  =  ((1),  ()),  Ys  =  ((2),  ()),  Y4  =  ((4),  ()),  and  Y5  =  ((4),  (1)). 

There  is  a  unique  a-rcgular  schedule  that  passes  through  the  sequence  of  profiles 
(Yi,  Y) Call  this  schedule  Sa.  We  claim  that  for  any  time  t, 

P  [T  (S,  x)  <  f]  <  P  [T  (Sa,  x )  <  a2t\  .  (A. 5) 

This  follows  from  the  fact  that  SQ  passes  through  the  profile  |"j;(<S'{t>)] ,  which  has  size  at 
most  a2t .  Thus  by  time  oft,  Sa  has  done  all  the  work  that  S  has  done  at  time  t  and  more. 
The  fact  that  (A. 5)  holds  for  all  t  implies  E  [T  (. Sa ,  x)]  <  of  •  E  [T  ( S ,  x)}. 

In  the  special  case  when  all  heuristics  are  executed  in  the  suspend-and-resume  model, 
the  argument  is  exactly  the  same,  except  that  now  each  profile  P  of  interest  contains  only 
a  single  number.  This  means  that  the  size  of  [P]  is  at  most  a  times  the  size  of  P,  and  thus 
for  any  state  Y  of  interest,  the  size  of  [Y]  is  at  most  a  times  the  size  of  Y .  □ 


Theorem  21.  Fix  an  approximation  ratio  a  >  1,  a  budget  B,  and  an  error  tolerance 
5  >  0.  Then  an  a 2  approximation  to  schedule 

S*  =  arg  min  E  [min  {Bk,  T  (S,  x)}] 

may  be  found  by  computing  a  shortest  path  in  a  state-space  graph  GrfB  =  (Y,  E,  Se), 
where  \V\  —  O  (B°^klogY° and  \E\  —  O  (logaP  |Y|).  The  ovemll  running  time  is 
O  (n(loga  5)50(Hogdog«B)ji  where  n  =  \X\. 
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Proof.  The  graph  Grf  B  =  (V.  E.  Se)  may  be  defined  inductively  as  follows.  The  vertex 
set  contains  the  empty  profile  Y®.  Let  Y  =  (1\ .  P2, . . . ,  Pf)  be  a  state  in  the  vertex  set. 
Let  Pj  be  a  profile  with  size  <  B  (assuming  one  exists).  Let  r  be  a  power  of  a  between  1 
and  B,  and  let  nr  be  the  number  of  runs  of  length  r  in  Pj.  Let  P'  be  the  profile  obtained  by 
increasing  the  number  of  runs  of  length  r  to  n'r,  where  n'r  is  the  smallest  integer  >  nr  that 
is  the  floor  of  a  power  of  a  (i.e.,  n'r  =  [al\  for  some  integer  i).  As  an  example,  if  a  =  2, 
Pj  =  (2, 2, 1, 1),  and  r  =  2,  then  P'  =  (2,  2,  2,  2, 1, 1).  The  vertex  set  contains  the  state 
Y'  =  (Pi,  P2, . . . ,  Pj- 1,  P' ,  Pj+i,  ■  ■  ■  Pk)-  The  edge  set  contains  the  edge  e  =  (Y,  Y'), 
where  Se(e)  contains  n'r  —  nr  copies  of  the  action  ( hj,r ).  Finally,  if  no  profile  of  size 
<  B,  exists,  then  there  is  an  edge  from  Y  to  v* ,  labeled  with  the  empty  schedule. 

By  construction,  any  a-rcgular  schedule  that  runs  each  heuristic  for  time  at  most  B 
corresponds  to  a  path  from  Y®  to  v*  in  the  graph.  To  make  the  weight  of  this  path  equal 
the  value  of  the  objective  function,  we  must  modify  the  weights  in  two  ways.  First,  any 
edge  from  Y  to  v* ,  where  performing  the  runs  in  Y  does  not  yield  a  solution  to  each 
instance  with  probability  at  least  1  —  5,  is  assigned  infinite  weight.  Thus,  only  paths 
corresponding  to  schedules  in  Ss  have  finite  weight.  Second,  the  schedules  Se(e)  must 
be  truncated  appropriately  so  that  every  schedule  in  Sg™b  has  length  Bk.  With  these 
modifications,  computing  a  shortest  path  from  Y®  to  v*  yields  the  a-rcgular  schedule  in 
Ss  that  minimizes  the  value  of  the  objective  function.  By  Lemma  9,  this  yields  an  a2 
approximation  to  S*  (the  fact  that  running  time  is  truncated  at  Bk  only  helps,  as  far  as  the 
proof  of  that  lemma  is  concerned). 

We  now  bound  \V\  and  \E\.  Assume  for  simplicity  that  B  is  a  power  of  a.  First,  note 
that  the  number  of  distinct  run  lengths  that  can  appear  in  an  a-rcgular  profile  equals  the 
number  of  powers  of  a  between  1  and  B,  which  is  1  +  logQ  B.  In  an  a-regular  profile, 
the  number  of  runs  of  each  particular  length  is  either  0  or  a  power  of  a  between  1  and  B. 
Thus  the  number  of  a-regular  profiles  of  size  at  most  B  is  at  most  (1  +  loga  B)2+loga  B  = 
QO(\oga  iogQ  £).  Because  each  vertex  is  a  A;-tuple  of  a-regular  profiles  of  size  at  most  B,  it 
follows  that,  \V\  <  B()(k log"  k,g"  B> .  Lastly,  because  each  edge  represents  increasing  the 
number  of  runs  of  length  r,  where  r  is  a  power  of  a  between  1  and  B,  each  vertex  has  at 
most  1  +  logQ  B  outgoing  edges,  so  | E\  <  (1  +  logQ  B)  \V\. 

To  complete  the  proof,  it  suffices  to  show  that  edge  weights  can  be  computed  in  time 
O  (' n )  per  edge.  Consider  the  special  case  n  —  1,  and  let  x  be  the  single  problem  instance 
in  X.  Consider  an  edge  e  =  (Y,  Y'),  and  let  S  =  Se(e).  The  weight  assigned  to  edge  e 
can  be  written  as 

fi(S) 

w(e,  x )  =  Q(Y)  ■  1  -px  (S'(i))  dt 

Jt= o 

where  Q(Y)  is  the  probability  that  performing  the  runs  in  state  Y  does  not  yield  a  solution 


169 


to  x  (this  definition  of  w(e,x)  is  consistent  with  equation  (3.4)).  Suppose  the  value  of 
Q(Y)  is  stored  at  the  vertex  Y.  Also,  suppose  that  the  value  of  1  —  px  (.%))  dt  has 
been  precomputed  in  advance,  for  all  choices  of  S  —  Se(e)  (the  number  of  choices  is  at 
most  k(  1  +  log,,  B)2,  so  the  time  required  for  the  precomputation  step  does  not  contribute 
to  the  overall  time  complexity).  Then  the  time  required  to  compute  w(e,  x )  is  0  (1).  Given 
the  value  of  Q(Y),  the  value  of  Q(Y')  =  Q(Y)(1  —  px(S))  can  also  be  computed  in  O  (1) 
time  and  stored  at  vertex  Y'  for  future  use,  assuming  that  we  precompute  px(S )  for  all 
possible  choices  of  S  (again,  this  does  not  affect  overall  time  complexity).  Finally,  in  the 
general  case  n  >  1,  the  weights  obtained  in  this  way  can  simply  be  summed  across  all  n 
instances.  □ 


Theorem  18.  If  TL  contains  a  single  ( randomized )  heuristic  and  X  =  {x}  contains  a 
single  instance,  then 

E  [T  (G,  a;)]  =  minE  [T  (S,  x)]  . 

Proof.  Luby  etal.  [61]  proved  that,  when  running  a  single  randomized  heuristic  on  a  single 
problem  instance,  the  optimal  schedule  is  a  uniform  restart  schedule  of  the  form 

ST  =  {(h,T),(h,T),(h,r),...)  . 

Let  p(r)  denote  the  probability  that  performing  the  action  (h,  r)  yields  a  solution  to  x. 

E  [T  (. ST ,  x)]  satisfies  the  recurrence  E  [ T  (ST,  x)]  =  fj=0  1  —  p(t')  dt '  +  (1  —  p{r))TT,  or 
rearranging, 

E  [T  (ST,  x)]  =  ^  1  -  pit’)  dt'^j  . 

To  complete  the  proof,  we  show  that  G'  =  ST*,  where  t*  =  argminr>0E  [T  (ST,  x)] 
(we  assume  for  the  moment  that  r*  is  unique).  Inductively,  suppose  6''  is  of  this  form, 
and  consider  the  quantity  that  is  maximized  when  choosing  g'.  Let  f(S)  =  px(S).  For 
any  action  (h,  r),  /  (G'j  ©  {(h,  r)))  —  /  (G' )  is  the  probability  that  action  (h,  r)  solves  the 
problem  after  every  action  in  G'  has  failed,  which  can  also  be  written  as  (1  —  /(G' ))  -p(r). 
Thus  g'j  =  (h,  r)  is  selected  so  as  to  maximize  the  quantity 

/(G'e((ft,T)»  -/(g;)  (1  -/(Gg)-p(r)  1 
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By  definition,  this  quantity  is  maximized  by  setting  r  =  t*.  Thus,  we  have  G'  =  ST*,  so 
G'  is  optimal  as  claimed. 

Finally,  note  that  even  if  r*  is  not  unique,  G'  will  be  a  uniform  schedule  as  long  as  ties 
are  broken  in  a  consistent  manner  (e.g.,  in  favor  of  the  smallest  value  of  r),  and  thus  G' 
will  be  optimal  by  the  arguments  just  given.  □ 

Lemma  12.  For  any  profile  P  =  (ti,T2,  ...  ,tk)  of  size  <  B,  define  LfiP)  —  {%'  :  1  < 
i'  <  B,  jr  >  Ti}.  Then  the  quantity 

-  (p\  TT  \{i'  e  Li(P)  :  Ti!  >  Ti}\ -  i  +  1 

qh(p)  -  y  \l,(p)\  ~i+i 

is  an  unbiased  estimate  ofqh(P)  (i.e.,  E  [qh(P)]  =  %{P)). 

Proof  Given  the  trace  T  =  (Tj,  T2, . . . ,  TB),  suppose  we  construct  a  new  trace  T  by 
randomly  permuting  the  elements  of  T  using  the  following  procedure  (the  procedure  is 
well-defined  assuming  \LfP)\  >  i  for  each  i,  which  follows  from  the  fact  that  f  >  T*  if 
P  has  size  <  B ): 

1.  For  i  from  1  to  K: 

•  Choose  lt  uniformly  at  random  from  LfiP)  \{li,l2,  ■  ■  ■ ,  h- 1}- 

2.  Set T'^{Th,Th,...TlK}. 

Because  the  indices  are  arbitrary,  P  [T7  encloses  P]  =  P  [T  encloses  P]  =  qh{P ). 

On  the  other  hand,  it  is  not  difficult  to  show  that  the  product  series  (]h(P)  equals  the 
conditional  probability  P  [T'  encloses  P  \T]  (by  construction,  the  ith  factor  in  the  product 
series  is  the  probability  that  XJ.  >  Tj,  conditioned  on  the  fact  that  Tig  >  r9  for  all  g  <  i). 

Thus  we  have 


E  [qh(P)]  =  E  [P  \T  encloses  P  \  T]] 
=  P  [T'  encloses  P] 

=  Qh(P)  . 


□ 
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A.3  The  Max  &>Armed  Bandit  Problem 


In  the  proofs  that  follow,  we  make  use  of  the  notation  introduced  in  §5.3.2.  In  particular, 
we  use  G  to  denote  a  GEV  distribution  satisfying  the  conditions  described  in  §5.3.2,  and 
we  use  rrij  to  denote  the  expected  maximum  of  j  independent  samples  from  G. 


Theorem  31.  Let  I  =  (Cf,  G2,  ■  ■  ■ ,  Gk)  be  an  instance  of  the  max  k-armed  bandit  prob¬ 
lem,  where  Gj  =  GEV^^^j,  and  £*  <  0  for  all  i.  Then  strategy  SGEV ,  run  on  instance 


I  with  parameters  e 


and  5  =  has  regret  O  ^ln(nfc)  ln(n)2^/^j. 


Proof  Building  on  the  proof  sketch  given  in  the  main  text,  it  remains  only  to  show  that 
for  any  arm  i, 

mln  ~  mln-t[k- 1)  =  O 

To  ease  notation,  let  A;7  =  A:  —  1;  and  let  //  =  pu  a  =  and  £  =  £j  be  the  parameters  of 
arm  i.  Suppose  £  =  0.  Then  by  Proposition  3,  m\  —  mln_tk,  =  a  (ln(n)  —  ln(n  —  tk'f). 
Thus  for  n  sufficiently  large, 


-  mln_tk,  =  a  (ln(n)  -  In (n  -  tk ')) 
'  n  —  tkr 


=  —a  In 


n 


1  M  tk> 
=  —a  m  l - 


n 


tk' 

<  2<j  — 
n 

=  o{* 

n 


where  on  the  fourth  line  we  have  used  the  fact  that  for  n  sufficiently  large,  —  <  |,  and  for 
0  <  x  <  ln(l  —  x)  >  —2x. 

Now  suppose  £  <  0.  By  Proposition  3,  rnf  —  m\_tk,  =  |T(1  —  ff)(rf  —  (n  —  tk'Y )  = 
0((n  —  tk'Y—wf)  where  we  have  used  the  fact  that  |T(1  —  £)  is  negative  and  has  bounded 
absolute  value.  Expanding  (n  —  tk'Y  in  powers  of  t  about  t  —  0  gives 

(n  —  tk'Y  =  rf  —  fn^1tk'  +  O  [(tk')2n^~2)  . 

Because  £  <  0  and  |£|  is  bounded,  it  follows  that  (n  —  tk'Y  nf  is  0{— ).  □ 
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In  order  to  prove  Lemma  15,  we  first  prove  the  following  lemma. 

Lemma  22.  For  any  fixed  positive  integer  j,  O  (4)  draws  from  G  suffice  to  obtain  an 
estimate  rhj  ofrrij  such  that 

3 

P  [\rhj  -  rrijl  <  e]  >  -  . 


Proof  First  consider  the  special  case  j  =  1.  Let  X  denote  the  sum  of  t  draws  from  G,  for 
some  to-be-specified  positive  integer  t.  Then  E  [A"]  =  mf  and  V ar[X]  =  cr2t,  where  a  is 
the  (unknown)  standard  deviation  of  G  (a  is  proportional  to,  but  not  the  same  as,  the  scale 
parameter  a  of  the  GEV  distribution  G).  We  take  fh  j  =  |  as  our  estimate  of  mi.  Then 


P  [|mi  —  mi|  >  e]  =  P  [|fmi  —  tmf\  >  te\ 


=  P 


\X  -  E  [X]\  >  ^-y/Var[X] 


-  fe2 


where  the  last  inequality  is  Chebyshev’s.  Thus  to  guarantee  P  [|mi  —  mi|  >  e]  <  |  we 
must  set  t  =  =  O  (4)  (note  that  due  to  the  assumptions  in  §5.3.2,  a  is  0(1)). 

In  the  general  case  j  >  1,  we  let  X  be  the  sum  of  t  block  maxima  (each  the  maximum 
of  j  independent  draws  from  G ).  Because  the  standard  deviation  of  M}  and  j  itself  are 
both  0(1),  the  lemma  follows  by  exactly  the  same  argument.  □ 


To  boost  the  probability  that  rn3  —  mf  <  e  from  |  to  1  —  S,  we  use  the  “median  of 
means”  method. 

Lemma  15.  Let  j  be  a  positive  integer  and  let  e  >  0  and  5  G  (0, 1)  be  real  numbers.  Then 
O  (in  (|)  4)  draws  from  G  suffice  to  obtain  an  estimate  rhj  ofrrij  such  that 

P  [[rhj  —  rrij\  <  e\  >  1  —  8  . 


Proof  We  invoke  Lemma  22  r  times  (for  r  to  be  determined),  yielding  a  set 


E 


_  (1)  _  (2)  _  (r) 

rrij  \rrij  .  ,rrij 


of  estimates  of  rrij.  Let  rhj  be  the  median  element  of  E.  Let  A  =  { rri  e  E  :  m  —  rn:l  <  e} 
be  the  set  of  “accurate”  estimates  of  rrij ;  and  let  A  =  A|.  Then  \rhj  —  rrij \  >  e  implies 
A  <  |,  while  E  [A]  >  |r.  Using  Chemoff’s  inequality,  we  have 


P  [ | rrij  —  rrijl  >  e]  <  P 


'  ,  r 
A  < 

.  “  2J 


<  exp 
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for  constant  C  >  0.  Thus  r  —  O  (ln(|))  repetitions  suffice  to  ensure  P  [\rrij  —  mj \  >  e] 

6. 


In  order  to  prove  Lemma  16,  we  must  first  prove  the  following  lemma.  Again,  we  use 
nrij  to  denote  the  expected  maximum  of  j  samples  from  a  GEV  distribution  G  satisfying 
the  assumptions  described  in  §5.3.2. 

Lemma  23. 

777.4  _  777 2  >  \(7  and 
7772  -  777 1  >  |(T  . 

Proof.  If  £  =  0,  then  by  Proposition  3, 7774  —  tt72  =  tt72  —  mi  =  1 1 1 (2) rr  and  we  are  done. 
Otherwise, 

7774  —  tt72  =  o-(2^  —  1)£_1T(1  —  £)  and 
7772  -  7771  =  <7 (4*  -  2«)£-1r(l  -  f)  . 

It  thus  suffices  to  prove  that 


?<f{2Tir(1_a}-i 

and 

f  4?  _  2^  1  1 

min  - r(l  }  >  -  . 

e<il  £  J-8 

To  do  so,  we  first  state  without  proof  the  following  properties  of  the  T  function: 

T(z)  >  [z\\  Vz>2 

r(4  >|  w  >  o 

Making  the  change  of  variable  y  =  —  £,  it  suffices  to  show 

f  1  —  2~y  )  1 

min  <  - T(1  +  y)>>~,  and 

y>-\  l  V  J  4 


mm 

y>- 5 


2-y(i  -  2 -y) 


r(i  +  t/) 


(A. 6)  holds  because  for  —  \  <  y  <  i, 


1  -  2~y  1  1 

— —  r(l  +  y)>-r(l  +  y)>if 


(A. 6) 

(A.7) 
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VI  □ 


while  for  y  >  1, 


i  -  2~y 


y 


r(i  +  j/)> 


Ly  +  iJ!  >  i 

2  y  ~  2  ' 


Similarly,  (A.7)  holds  because  for 


2~v(l  - 

y 


2~y) 

— r(l  +  y) 


while  for  y  >  1, 


2-y(i  -  2-w) 


r(i  +  j/)  > 


b  +  lj!  >  1 

2y(2v)  ~  8  ' 


□ 


We  are  now  ready  to  prove  Lemma  16. 

Lemma  16.  For  real  numbers  e  >  0  and  5  G  (0, 1),  O  (ln(4)4)  draws  from  G  suffice  to 
obtain  an  estimate  £  oft;  such  that 

P  [|^  -  ^|  <  e]  >1-5. 


Proof  In  the  proof  sketch  in  the  main  text,  we  showed  that. 


£ 


log2 


'777  4  —  777.2 
7772  —  777 1 


(A. 8) 


Let  777i,  7772,  and  7774  be  estimates  of  7771,  t772,  and  7774,  respectively,  and  let  £  be  the 
estimate  of  £  obtained  by  plugging  7774,  m2,  and  7774  into  this  equation.  Define  Am  = 
max^gj!  2,4}  777 j  —  ni  j  and  define  A^  =  |£  —  £|.  Building  on  the  proof  sketch  in  the  main 
text,  it  remains  only  to  show  that  Ag  =  O  (Am). 

In  the  proof  of  Theorem  31  we  showed  that  |  ln(x  +  (3)  —  ln(x)|  <  2^  for  (3  <  |. 
Letting  N  =  -7774  —  7772  and  D  —  m2  —  rn  1 ,  and  noting  that  £  =  log 2(iV)  —  log2(D)  = 
jP(ln(A)  —  ln(L>)),  it  follows  that 

1  /  2(2Am)  2(2Am)\ 

-  to(2)  V  Tv  1  ^  3 

for  Am  <  f  min (N,D).  Thus  by  Lemma  23  and  the  assumption  that  cr  >  0£,  A^  is 
0(Am),  as  claimed.  □ 
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Lemma  18.  Assume  G  has  shape  parameter  £  <  —  £*.  Lei  n  be  a  positive  integer  and  let 
e  >  0  and  5  G  (0, 1)  be  real  numbers.  Then  O  (ln(|)4)  draws  from  G  suffice  to  obtain  an 
estimate  mn  of  mn  such  that 


P  [| mn  —  mn |  <  e]  >  1  —  5  . 

Proof  In  the  proof  sketch  in  the  main  text,  we  showed  that 

logpO) 

rrij  —  «i  +  a2oi3 


where 

ai  =  (mim4  —  —  2m2  +  m4)_1 

a2  =  (—2  m4m2  +  —  2m2  +  m4)_1 

«3  =  («t4  -  m2)(m2  -  mi)'1  . 


Let  mb  m2,  and  rn4  be  estimates  of  m4,  m2,  and  m4,  respectively.  Plugging  m4,  m2, 
and  m4  into  the  above  equations  yields  estimates,  say  a4,  d2,  and  d3,  of  ai,  a2,  and  a3, 
respectively.  Define  Am  =  maxje{lj2)4|  \rhj  —  mf  and  AQ  =  maxje{ij2i3|  \di  —  af.  To 
complete  the  proof,  it  remains  to  show  that 

|  fhn  -  mn |  =  0(Am)  . 

The  argument  consists  of  two  parts:  in  claims  1  through  3  we  show  that  Aa  is  0{ Am), 
then  in  Claim  4  we  show  that  \rhn  —  mn\  is  0(AQ). 

Claim  1.  Each  of  the  numerators  in  the  expressions  for  a4,  a2,  and  o3  has  absolute  value 
bounded  from  above,  while  each  of  the  denominators  has  absolute  value  bounded  from 
below.  (The  bounds  are  independent  of  the  unknown  parameters  of  G.) 

Proof  of  claim  1.  The  numerators  will  have  bounded  absolute  value  as  long  as  m4,  m2, 
and  m3  are  bounded.  Upper  bounds  on  m4,  m2,  and  m.>  follow  from  the  restrictions  on  the 
parameters  p,  cr,  and  £.  As  for  the  denominators,  by  Lemma  23  we  have 


| mi  -  2 m2  +  m4|  =  |(m2  -  mi)(a3  -  l)j 
>  \oA2-C  -  1|  . 


□ 


Claim  2.  Let  N  and  D  be  fixed  real  numbers,  and  let  3n  and  (3d  be  real  numbers  with 
\Pd\  <  Then  |gf*  -  %\ is  0(\pN\  +  \(3D\). 
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Proof  of  claim  2.  First,  using  the  Taylor  series  expansion  of  d^/3d. 


N  N  =  N(3d ^  (/3d\* 

D  +  (5d  D  D2  ^  ;  \D  ) 

<  NpD 
~  D*(i-pDD-i) 

=  0( \pD\)  . 

Then 

N  +  Pn  N  ^  N  N  (3n 

D  +  Pd  D  -  D  +  Pd~D  +  D  +  Pd 

=  o  ( \Pn\  +  \Pd\)  ■ 

□ 

Claim  3.  Aa  is  0(Am). 

Proof  of  claim  3.  We  show  that  |cei  —  af  is  0(Am).  Similar  arguments  show  that  |d2  —  oi2\ 
and  \a4  —  a4 1  are  0(Am),  which  proves  the  claim.  To  see  that  \a,i  —  af  is  0(Am),  let 
N  =  mi'm4  —  ml,  and  let  D  —  m4  —  2 m2  +  m4,  so  that  a4  —  ^.  Define  N  and  D  in  the 
natural  way  so  that  dt\  =  Because  m\,m2,  and  m3  are  all  0(1)  (by  Claim  1),  it  follows 
that  both  \N  —  N\  and  | D  —  D\  are  0(Am).  That  |ay  —  af  is  0(Am)  follows  by  Claim 
2.  □ 

Claim  4.  \mn  —  mn\  is  O  (Aa). 

Proof  of  claim  4.  Because  &  <  £  <  —  £*  it  must  be  that  0  <  2^  <  a3  <  2“^*  <  1.  So  for 
Aa  sufficiently  small,  0  <  a3  <  1. 

1 1  -  I  ( —  i  —  —  logo (n)\  (  .  logo (n) \ 

||  mn  -  mn  |  =  I  ai  +  a2a3  )  -  (  «i  +  ol2 a3  ) 

<  |Ai  -  «!|  +  d24og2(n)  -  d2c4og2(n) 

+  a2al°^{n)  -  a2al^n) 

<  |di  —  oy|  +  |d2|  |d3  —  as\  T  |ci2  —  cr2| 

=  0(Aa) 

where  on  the  third  line  we  have  used  the  fact  that  both  a3  and  o3  are  between  0  and  1,  and 
in  the  last  line  we  have  used  the  fact  that  |a2|  is  0(1).  □ 
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□ 


Lemma  19.  Assume  G  has  shape  parameter  £  >  £*.  Let  n  be  a  positive  integer  and  let 
e  >  0  and  5  G  (0, 1)  be  real  numbers.  Then  O  ^ln(|)ln^  ^  draws  from  G  suffice  to 
obtain  an  estimate  mn  ofmn  such  that 


P 


1  <  mn  -  qi 

1  +  e  mn  —  ai 


<(l  +  e) 


>1  —  5 


where  oil  =  p  —  |. 


Proof  We  use  the  same  estimation  procedure  as  in  the  proof  of  Lemma  18.  Let  on,  a2, 
a3,  Aa,  and  Am  be  defined  as  they  were  in  that  proof. 

The  inequality  jC-  <  mn~ai  <  1  +  e  is  the  same  as  |  ln(mn~ai  )|  <  ln(l  +  e).  For 
ln(l  +  e)  >  |e,  so  it  suffices  to  guarantee  that 

7 

|  ln(mn  -  af)  -  ln(mn  —  a;i)  |  <  — e  . 

O 

Claim  1.  |  ln(mn  —  af)  —  In (mn  —  ai)|  is  0(ln(n)Aa). 


Proof  of  claim  1.  Because  |  ln(mn  —  af)  —  In  (mn  —  af)  |  is  0(Aa),  it  suffices  to  show  that 
|  ln(mn  —  oil)  ~  In (mn  —  ai)|  is  0(ln(n)Aa).  This  is  true  because 

ln(mn  —  af  =  In  j 

=  log2  (n)  ln(d3)  +  ln(d2) 

=  log2(n)  ln(a3)  +  ln(a2)  ±  O  (ln(n)Aa) 

=  In  ^a2a3OS2<n^  ±  O  (ln(n)Aa) 

=  ln(mn  —  af)  ±  O  (ln(n)Aa)  . 


□ 


Setting  Aa  <  Cl  (ln(n)_1e)  then  guarantees  ]  ln(mn)  —  ln(m„)|  <  |e.  By  Claim 
3  of  the  proof  of  Lemma  18  (which  did  not  depend  on  the  assumption  £  <  0),  Aa  is 
O  (Am),  so  we  require  P  [Am  <  Cl  (ln(n)_1e)]  >1  —  5.  Define  A  j  =  \  fhj  —  mf,  so  that 
Am  =  maxje{i)2  j4}  A j.  It  suffices  that  P  [A j  <  D  (ln(n)  1e)]  >  1  —  |  for  j  e  (1,  2, 4}. 

By  Lemma  15,  ensuring  this  requires  O  ^ln(|)ln^  ^  draws  from  G. 

□ 
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Lastly,  the  following  theorem  complements  Theorem  3 1  by  describing  the  behavior  of 
SGEV  when  some  arms  have  shape  parameter  £  >  0. 


Theorem  33.  Let  I  =  (G\,  G2, . . . ,  Gk)  be  an  instance  of  the  max  k-armed  bandit  prob¬ 
lem,  where  Gi  =  G where  f,  >  0  for  some  i.  Let  S  denote  the  strategy  SGEV , 

run  with  parameters  e  =  \  -  and  5  =  t-O.  Then 

1  v  n  knz 


E[M(I,S,n)]  -ai 
m*  —  ai 


1-0 


In {nk)  ln(n 


,2  3 


where  rn*  =  maxi <i<k  m\v  and  a\  =  maxi<j<fc  at,  where  ati  =  pt  —  |\ 


Proof.  For  the  moment,  let  us  assume  that  all  arms  have  shape  parameter  f  >  0.  Let  A 
be  the  event  (which  occurs  with  probability  at  least  1  —  kS )  that  all  estimates  obtained  in 
step  1  (a)  satisfy  the  inequality  in  Theorem  32. 

To  ease  notation,  let  A  =  in  (nk)  ln(n)2-y/|,  and  let  rn3  =  rrf  denote  the  expected 
maximum  of  j  draws  from  the  arm  i  selected  for  exploitation. 

Claim  1.  To  prove  the  theorem,  it  suffices  to  show  that  A  implies  =  1  +  0(A). 

m n  —  tk  ^1  v  7 


Proof  of  claim  1.  Because  M(I,  S,n )  >  rhn-tk  and  the  event  A  occurs  with  probability 
at  least  1  —  k5,  it  suffices  to  show  that  A  implies 


(1  -  5k)mn-tk  -  an 

m*  —  oi\ 


1  —  0(A)  . 


Because  is  O  =  o(A),  it  suffices  to  show  that  A  implies 

rhn-tk  —  «i  „  _  .  .  . 

-GTAT - 1  =  1  -0(A)  . 

m*n  -  a1 

This  can  be  rewritten  as  m*  —  —  af)  =  {mn-tk  ~  «i)(l  +  0(A))  (we 

can  replace  1_^A)  with  1  +  0(A)  because  for  r  <  =  1  +  <  1  +  2r).  □ 

Claim  2.  ^  =  1  +  0(A). 

mn_tk-a  i  v  > 
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Proof  of  claim  2.  Using  Proposition  3, 


In 


mr 


0t\ 


TTln—tk  &1 


In  ( 

£  ( In  (// )  —  ln(n 


of) 

n 

0(A). 


tk)) 


The  claim  follows  from  the  fact  that  exp(/3)  <  1  +  |/3  for  (3  <  so  that  exp(0(A))  = 
1  +  0(A).  "  □ 

Claim  3.  A  implies  that  for  all  i. 


m\  —  ai 


<C  1  T  6  . 


Proof  of  claim  3.  By  definition,  oi\  =  a\—  (3  for  some  (3  >  0.  The  claim  follows  from  the 
fact  that  for  positive  N  and  D  and  (3  >  0,  ^  <  1  +  e  implies  A±|  <  1  +  e.  □ 

Claim  4.  A  implies  mn~ai  =  1  +  0(A). 

r  mn-tk-a  l  v  ' 

Proof  of  claim  4. 

m*n  —  ol\  m*  —  a±  m*  —  cti  mln  —  mln  —  ai 
-  ol\  fK  -  -  «i  rnf  -  -  ax 

<(l  +  e)-l-(l  +  £)-(l  +  0(A)) 

=  1  +  0(A) 

where  in  the  second  step  we  have  used  claims  2  and  3.  □ 

Putting  claims  1  and  4  together  completes  the  proof.  To  remove  the  assumption  that 
all  arms  have  f  >  0,  we  need  to  show  that  A  implies  that  for  n  sufficiently  large,  the 
arms  i  and  i*  (the  only  arms  that  play  a  role  in  the  proof)  will  have  shape  parameters  >  0. 
This  follows  from  the  fact  that  if  <  0,  mln  is  0(ln(n)),  while  if  f  >  f*  >  0,  mln  is 

O(n^).  □ 
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